SlideShare a Scribd company logo
In-Memory Data Streams
With
NEIL STEVENSON
neil@hazelcast.com
27th May 2017
13:25-14:10
© 2017 Hazelcast Inc. Confidential & Proprietary
Outline
• Hazelcast
• → The company, the software, and my role
• Background
• → Why stream at all ?
• Java 8 streams
• → What did Java 8 add to Java 7
• → Why isn’t this good enough ?
• Hazelcast Jet, part #1
• → Introduction and outline architecture
• → Low level abstractions : directed acyclic graphs
• A sample application, available to download : not Word Count
• Hazelcast Jet, part #2
• → Higher level abstractions → distributed java.util.stream
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast : The company, the software and my role
The Company
Founded in 2008, based out of Palo Alto, California with offices worldwide
Provides commercial support and valid-add subscription features for open source Hazelcast software
The Software
Apache 2 licensed, available to download from Github, from https://hazelcast.org or
https://hazelcast.com
My Role
Solutions Architect – help customers, give talks, drink coffee, write code, drink coffee
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
DAG = Directed Acyclic Graph
Model the flow of data from processing stage to processing stage
→ a stream of data, potentially infinite
→ process as it comes in, don’t save first, maybe never save
→ enrich, deplete, filter, split, etc as data passes through
→ at memory speeds, no waiting for disks
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
DAG = Directed Acyclic Graph
Model the flow of data from processing stage to processing stage
→ a stream of data, potentially infinite
→ process as it comes in, don’t save first, maybe never save
→ enrich, deplete, filter, split, etc as data passes through
→ at memory speeds, no waiting for disks
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
6
Stream and Fast In-Memory Batch Processing
Enrichment
Databases
IoT
Social
Networks
Enterprise
Applications
Databases/
Hazelcast IMDG
HDFS/
Spark
Stream
Stream
Stream
Batch
Batch
Ingest
Alerts
Enterprise
Applications
Interactive
Analytics
Databases/
Hazelcast IMDG
Output
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet : Directed Acyclic Graph
VERTEX
The vertex is just the processing node in a pipeline.
→ Input comes in from somewhere, the first stage or the previous stage
→ Output goes out somewhere, the last stage or the next state stage
→ Stateless or stateful
→ Split, filter, enrich, deplete, fan-out, fan-in the data, many possibilities
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet : Directed Acyclic Graph
EDGE
The edge is just the data transmission in the pipeline.
→ Out of one processor into the next one
→ Out of one processor into the next ones
→ The next processor can be on any JVM, local or distributed routing
→ Back-pressure system throttles producer when consumer cannot keep up
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Stream Processing
Traditional processing is based on calculations on stored data
Stream processing is about calculations prior to storage
Streams are immutable
Streams may be infinite
The “pipeline” paradigm, (input →process →output)
Pipeline stages are lambdas : (x, y) -> {return x * y;}
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 –Jet Engine
What does it do ?
Stream Processing
In-memory
Distributed
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Word Count is the “hello world” of stream processing:
The Problem
 Count how many times each word occurs in some text
 Trivial, but shows some major concepts
Input
 Hamlet’s Soliloquy
1: To be, or not to be, that is the Question:
2: Whether ’tis Nobler in the mind to suffer
3: The Slings and Arrows of outragious Fortune,
4: Or to take Armes against a Sea of troubles,
Output
the=23
to=14
and=13
be=4
…
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));
In Java we would basically iterate and tally
How can the JVM optimise?
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Input OutputTokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals
Once everything is finished,
emit all pairs of (word, count)
(text) (word) (word, count)
But really this is just a pipeline, so a DAG
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Input
(text) (word)
Output
(word, count)
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals.
Once everything is finished,
emit all pairs of (word, count)
Using queues between vertices allows each to run in parallel, at their own speed
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Output
(word, count)
ReducerInput
Tokenizer
Tokenizer
We can exploit multiple CPUs because lines can be processed in parallel
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
(word)
(word)
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Use routing algorithms to select the next vertex or vertices
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Node
Node
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Distribute!!
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
The Problem :
Time-series foreign exchange prices.
We want to compute moving averages in various ways
→ last n measurements, last 15, last 50, etc
Why ?
→ rapidly changing data
→ time-to-market benefits from fast processing
Why ?
→ gives a clearer view of the trend
Why ?
→ to demonstrate a different architecture pattern
→ processing a stream of data, don’t save first then analyse
→ partitioning a stream of data, for scaling
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
The Data
For convenience, we’re using end of day prices rather than live prices, so frequency is one
sample per 24x60x60x1000 milliseconds. And only for the Euro.
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time="2017-04-20">
<Cube currency="USD" rate="1.0745"/>
<Cube currency="JPY" rate="117.16"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="26.907"/>
<Cube currency="DKK" rate="7.4381"/>
<Cube currency="GBP" rate="0.8392"/>
<Cube currency="HUF" rate="313.5"/>
<Cube currency="PLN" rate="4.2588"/>
<Cube currency="RON" rate="4.5405"/>
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Last n
Window
Input:
FX feed
(from,to,price)
One Solution
Input arrives as a stream of individual prices. Eg ”EUR,GBP,0.8392”
Collate these into batch of n per pair
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
(from,to,price)
n * (from,to,price)
n * (from,to,price)
One Solution
Send a self-contained parcel of work to each calculator
A batch of n prices for a pair, eg. ”EUR,GBP,0.8392, 0.8391, 0.8390, 0.8389, …”
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Output:
Store A
Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
Output:
Store B
(from,to,price)
n * (from,to,price)
n * (from,to,price)
(from,to,average)
(from,to,average)
One Solution
Stream out the averages….
Your output is someone else’s input
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Output:
Store A
Input:
FX feed
Output:
Store B
(from,CAD,price)
Last n
Window
Simple
Average
Weighted
Average
n * (from, USD,price)
n * (from, USD,price)
(from, USD,,average)
(from, CAD,,average)
Last n
Window
Simple
Average
Weighted
Average
n * (from CAD,price)
n * (from, CAD,price)
(from,USD,price)
(from, CAD,,average)
(from, USD,,average)
One Solution
Partition provides performance. Send US Dollars and Canadian Dollars to different processor
clones
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
One Solution
DEMO
https://github.com/neilstevenson/jeeconf2017
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
One Solution
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Jet capability is easy to add to IMDG
Two steps and you’re ready to submit jobs!
<dependency>
<groupId>com.hazelcast.jet</groupId>
<artifactId>hazelcast-jet</artifactId>
<version>0.3.1</version>
</dependency>
@Bean
public JetInstance jetInstance(Config config) {
JetConfig jetConfig = new JetConfig();
jetConfig.setHazelcastConfig(config);
return Jet.newJetInstance(jetConfig);
}
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Jet capability is the processing, but what about the start and end of the pipelines ?
A source creates output without input.
A sink consumes input without output.
Where it goes is just a matter of plumbing
→ Hazelcast IMDG, IMap and IList
→ Kafka
→ HDFS
→ flat files
→ sockets
→ easy to write your own, they’re just vertices
implement process() to consume input
implement complete() to generate output
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG construction is easy(?)
Create vertices, and edges to link them
public MaDAG (final int last) {
Vertex mapSource = this.newVertex("mapSource",
Processors.readMap(Constants.MAP_HISTORIC_CURRENCY));
Vertex lastN = this.newVertex("lastN", new LastNProcessorSupplier(last));
this.edge(Edge.between(mapSource, lastN).partitioned(new MaKeyExtractor()));
Vertex sma = this.newVertex("sma", SmaProcessor::new);
this.edge(Edge.from(lastN, 0).to(sma));
Vertex smaMapSink = this.newVertex("smaMapSink",
Processors.writeMap(Constants.MAP_SMA));
this.edge(Edge.between(sma, smaMapSink));
But is there any easier way ?
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
java.util.stream
An easier(?) way to construct a pipeline
Change from Java 8
Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
com.hazelcast.jet.stream
An easier(?) way to construct a pipeline
Change to Jet
IStreamMap<Integer, String> streamMap = IStreamMap.streamMap(sourceMap);
IMap<String, Integer> wordCounts = streamMap.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toIMap(
key -> key,
value -> 1,
Integer::sum));
More thinking than typing
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG v java.util.stream
JET provides java.util.stream interface – high-level constructs
like Java 8’s collect(), distinct(),filter(), reduce(), sorted() etc
but run distributed
Or use the DAG approach, for low-level fine grained approached
Or mix & match
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) ->
traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG v java.util.stream
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
Here filter implements
java.util.stream.Stream<T>
java.util.stream.Stream.filter(Predicate<? super T> predicate)
But the Jet version is
com.hazelcast.jet.stream.DistributedStream<T>
com.hazelcast.jet.stream.DistributedStream.filter(
(com.hazelcast.jet.Distribtued.Predicate<? super T> predicate)
So you can send copies to the grid to execute, remotely and in parallel
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Architecture
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Roadmap
34
Features Description
Robust Stream Processing
Processing guarantees for stream processing | Streaming specific
features (windowing, triggering)
High Performance
Hazelcast Integrations
JCache | Map and Cache events using partition ring buffer | CQ
Cache | Projection and Predicate for Map source
Management Center Management and monitoring features for Jet.
More Connectors JMS | JDBC
Cloud Deployment Pivotal Cloud Foundry | Open Shift
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Performance
Fastest in town!
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Performance
Run the graph on as many machines as necessary or available
→ Fan-out the input
→ Send from node to node, local or distributed
→ Fan-in the output
© 2017 Hazelcast Inc. Confidential & Proprietary
Conclusions
Stream Processing
• Suitable when data arrives too fast to process after storing, or where you don’t care to store
• Needs a much more functional programming style than tradition Java
• → lambdas feature heavily
• Java streams is ok, might be all you need
• → makes good use of a single machine
• Jet streams is better, for bigger volumes
• → makes use of multiple machines
• Jet is from Hazelcast
• → easy to get going, deploy to bare metal or any cloud
• Alternatives exist, such as Spark and Flink
• → Jet is open-source, Java, faster, no Zookeeper
© 2017 Hazelcast Inc. Confidential & Proprietary
The End
https://github.com/neilstevenson/jeeconf2017
neil@hazelcast.com
https://jet.hazelcast.org/
https://github.com/hazelcast/hazelcast-jet
Stack Overflow “hazelcast-jet” or Google Group
https://gitter.im/hazelcast/home

More Related Content

PDF
Getting started with Hadoop, Hive, Spark and Kafka
PPTX
Hazelcast Jet v0.4 - August 9, 2017
PPTX
Stream Processing and Real-Time Data Pipelines
PDF
Malware vs Big Data
PPTX
Data science bootcamp with pixie dust
PPTX
Odsc london data science bootcamp with pixie dust
PPTX
The new dominant companies are running on data
PDF
OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
Getting started with Hadoop, Hive, Spark and Kafka
Hazelcast Jet v0.4 - August 9, 2017
Stream Processing and Real-Time Data Pipelines
Malware vs Big Data
Data science bootcamp with pixie dust
Odsc london data science bootcamp with pixie dust
The new dominant companies are running on data
OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano

Similar to JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet (20)

PPTX
GDG Helwan Introduction to python
PDF
Reactive Stream Processing Using DDS and Rx
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PDF
Adtech scala-performance-tuning-150323223738-conversion-gate01
PDF
Adtech x Scala x Performance tuning
PDF
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
PPTX
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
PDF
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
PDF
A Gentle Introduction to GPU Computing by Armen Donigian
PDF
The "Holy Grail" of Dev/Ops
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
PPTX
Going open source with small teams
PPTX
Plan a successful enterprise Linux migration
PPTX
GluonCV
PDF
Running Cognos on Hadoop
PDF
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
PPTX
Cloudy with a Chance of Databases
PDF
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
PDF
Get full visibility and find hidden security issues
GDG Helwan Introduction to python
Reactive Stream Processing Using DDS and Rx
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech x Scala x Performance tuning
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
A Gentle Introduction to GPU Computing by Armen Donigian
The "Holy Grail" of Dev/Ops
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Going open source with small teams
Plan a successful enterprise Linux migration
GluonCV
Running Cognos on Hadoop
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Cloudy with a Chance of Databases
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Get full visibility and find hidden security issues
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Ad

JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet

  • 1. In-Memory Data Streams With NEIL STEVENSON neil@hazelcast.com 27th May 2017 13:25-14:10
  • 2. © 2017 Hazelcast Inc. Confidential & Proprietary Outline • Hazelcast • → The company, the software, and my role • Background • → Why stream at all ? • Java 8 streams • → What did Java 8 add to Java 7 • → Why isn’t this good enough ? • Hazelcast Jet, part #1 • → Introduction and outline architecture • → Low level abstractions : directed acyclic graphs • A sample application, available to download : not Word Count • Hazelcast Jet, part #2 • → Higher level abstractions → distributed java.util.stream
  • 3. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast : The company, the software and my role The Company Founded in 2008, based out of Palo Alto, California with offices worldwide Provides commercial support and valid-add subscription features for open source Hazelcast software The Software Apache 2 licensed, available to download from Github, from https://hazelcast.org or https://hazelcast.com My Role Solutions Architect – help customers, give talks, drink coffee, write code, drink coffee
  • 4. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data DAG = Directed Acyclic Graph Model the flow of data from processing stage to processing stage → a stream of data, potentially infinite → process as it comes in, don’t save first, maybe never save → enrich, deplete, filter, split, etc as data passes through → at memory speeds, no waiting for disks
  • 5. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data DAG = Directed Acyclic Graph Model the flow of data from processing stage to processing stage → a stream of data, potentially infinite → process as it comes in, don’t save first, maybe never save → enrich, deplete, filter, split, etc as data passes through → at memory speeds, no waiting for disks
  • 6. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data 6 Stream and Fast In-Memory Batch Processing Enrichment Databases IoT Social Networks Enterprise Applications Databases/ Hazelcast IMDG HDFS/ Spark Stream Stream Stream Batch Batch Ingest Alerts Enterprise Applications Interactive Analytics Databases/ Hazelcast IMDG Output
  • 7. © 2017 Hazelcast Inc. Confidential & Proprietary Jet : Directed Acyclic Graph VERTEX The vertex is just the processing node in a pipeline. → Input comes in from somewhere, the first stage or the previous stage → Output goes out somewhere, the last stage or the next state stage → Stateless or stateful → Split, filter, enrich, deplete, fan-out, fan-in the data, many possibilities
  • 8. © 2017 Hazelcast Inc. Confidential & Proprietary Jet : Directed Acyclic Graph EDGE The edge is just the data transmission in the pipeline. → Out of one processor into the next one → Out of one processor into the next ones → The next processor can be on any JVM, local or distributed routing → Back-pressure system throttles producer when consumer cannot keep up
  • 9. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Stream Processing Traditional processing is based on calculations on stored data Stream processing is about calculations prior to storage Streams are immutable Streams may be infinite The “pipeline” paradigm, (input →process →output) Pipeline stages are lambdas : (x, y) -> {return x * y;}
  • 10. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 –Jet Engine What does it do ? Stream Processing In-memory Distributed
  • 11. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Word Count is the “hello world” of stream processing: The Problem  Count how many times each word occurs in some text  Trivial, but shows some major concepts Input  Hamlet’s Soliloquy 1: To be, or not to be, that is the Question: 2: Whether ’tis Nobler in the mind to suffer 3: The Slings and Arrows of outragious Fortune, 4: Or to take Armes against a Sea of troubles, Output the=23 to=14 and=13 be=4 …
  • 12. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet(); Map<String, Integer> wordCounts = entrySet.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toMap( key -> key, value -> 1, Integer::sum)); In Java we would basically iterate and tally How can the JVM optimise?
  • 13. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Input OutputTokenizer Reducer Split the text into words For each word emit (word) Collect running totals Once everything is finished, emit all pairs of (word, count) (text) (word) (word, count) But really this is just a pipeline, so a DAG
  • 14. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Input (text) (word) Output (word, count) Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals. Once everything is finished, emit all pairs of (word, count) Using queues between vertices allows each to run in parallel, at their own speed
  • 15. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Output (word, count) ReducerInput Tokenizer Tokenizer We can exploit multiple CPUs because lines can be processed in parallel
  • 16. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count (word) (word) Input Output Tokenizer Tokenizer Reducer Reducer Use routing algorithms to select the next vertex or vertices
  • 17. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Node Node Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Distribute!!
  • 18. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency The Problem : Time-series foreign exchange prices. We want to compute moving averages in various ways → last n measurements, last 15, last 50, etc Why ? → rapidly changing data → time-to-market benefits from fast processing Why ? → gives a clearer view of the trend Why ? → to demonstrate a different architecture pattern → processing a stream of data, don’t save first then analyse → partitioning a stream of data, for scaling
  • 19. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency The Data For convenience, we’re using end of day prices rather than live prices, so frequency is one sample per 24x60x60x1000 milliseconds. And only for the Euro. <gesmes:Sender> <gesmes:name>European Central Bank</gesmes:name> </gesmes:Sender> <Cube> <Cube time="2017-04-20"> <Cube currency="USD" rate="1.0745"/> <Cube currency="JPY" rate="117.16"/> <Cube currency="BGN" rate="1.9558"/> <Cube currency="CZK" rate="26.907"/> <Cube currency="DKK" rate="7.4381"/> <Cube currency="GBP" rate="0.8392"/> <Cube currency="HUF" rate="313.5"/> <Cube currency="PLN" rate="4.2588"/> <Cube currency="RON" rate="4.5405"/>
  • 20. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Last n Window Input: FX feed (from,to,price) One Solution Input arrives as a stream of individual prices. Eg ”EUR,GBP,0.8392” Collate these into batch of n per pair
  • 21. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Last n Window Simple Average Weighted Average Input: FX feed (from,to,price) n * (from,to,price) n * (from,to,price) One Solution Send a self-contained parcel of work to each calculator A batch of n prices for a pair, eg. ”EUR,GBP,0.8392, 0.8391, 0.8390, 0.8389, …”
  • 22. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Output: Store A Last n Window Simple Average Weighted Average Input: FX feed Output: Store B (from,to,price) n * (from,to,price) n * (from,to,price) (from,to,average) (from,to,average) One Solution Stream out the averages…. Your output is someone else’s input
  • 23. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Output: Store A Input: FX feed Output: Store B (from,CAD,price) Last n Window Simple Average Weighted Average n * (from, USD,price) n * (from, USD,price) (from, USD,,average) (from, CAD,,average) Last n Window Simple Average Weighted Average n * (from CAD,price) n * (from, CAD,price) (from,USD,price) (from, CAD,,average) (from, USD,,average) One Solution Partition provides performance. Send US Dollars and Canadian Dollars to different processor clones
  • 24. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency One Solution DEMO https://github.com/neilstevenson/jeeconf2017
  • 25. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency One Solution
  • 26. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Jet capability is easy to add to IMDG Two steps and you’re ready to submit jobs! <dependency> <groupId>com.hazelcast.jet</groupId> <artifactId>hazelcast-jet</artifactId> <version>0.3.1</version> </dependency> @Bean public JetInstance jetInstance(Config config) { JetConfig jetConfig = new JetConfig(); jetConfig.setHazelcastConfig(config); return Jet.newJetInstance(jetConfig); }
  • 27. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Jet capability is the processing, but what about the start and end of the pipelines ? A source creates output without input. A sink consumes input without output. Where it goes is just a matter of plumbing → Hazelcast IMDG, IMap and IList → Kafka → HDFS → flat files → sockets → easy to write your own, they’re just vertices implement process() to consume input implement complete() to generate output
  • 28. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG construction is easy(?) Create vertices, and edges to link them public MaDAG (final int last) { Vertex mapSource = this.newVertex("mapSource", Processors.readMap(Constants.MAP_HISTORIC_CURRENCY)); Vertex lastN = this.newVertex("lastN", new LastNProcessorSupplier(last)); this.edge(Edge.between(mapSource, lastN).partitioned(new MaKeyExtractor())); Vertex sma = this.newVertex("sma", SmaProcessor::new); this.edge(Edge.from(lastN, 0).to(sma)); Vertex smaMapSink = this.newVertex("smaMapSink", Processors.writeMap(Constants.MAP_SMA)); this.edge(Edge.between(sma, smaMapSink)); But is there any easier way ?
  • 29. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine java.util.stream An easier(?) way to construct a pipeline Change from Java 8 Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet(); Map<String, Integer> wordCounts = entrySet.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toMap( key -> key, value -> 1, Integer::sum));
  • 30. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine com.hazelcast.jet.stream An easier(?) way to construct a pipeline Change to Jet IStreamMap<Integer, String> streamMap = IStreamMap.streamMap(sourceMap); IMap<String, Integer> wordCounts = streamMap.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toIMap( key -> key, value -> 1, Integer::sum)); More thinking than typing
  • 31. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG v java.util.stream JET provides java.util.stream interface – high-level constructs like Java 8’s collect(), distinct(),filter(), reduce(), sorted() etc but run distributed Or use the DAG approach, for low-level fine grained approached Or mix & match Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty())));
  • 32. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG v java.util.stream Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty()))); Here filter implements java.util.stream.Stream<T> java.util.stream.Stream.filter(Predicate<? super T> predicate) But the Jet version is com.hazelcast.jet.stream.DistributedStream<T> com.hazelcast.jet.stream.DistributedStream.filter( (com.hazelcast.jet.Distribtued.Predicate<? super T> predicate) So you can send copies to the grid to execute, remotely and in parallel
  • 33. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Architecture
  • 34. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Roadmap 34 Features Description Robust Stream Processing Processing guarantees for stream processing | Streaming specific features (windowing, triggering) High Performance Hazelcast Integrations JCache | Map and Cache events using partition ring buffer | CQ Cache | Projection and Predicate for Map source Management Center Management and monitoring features for Jet. More Connectors JMS | JDBC Cloud Deployment Pivotal Cloud Foundry | Open Shift
  • 35. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Performance Fastest in town!
  • 36. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Performance Run the graph on as many machines as necessary or available → Fan-out the input → Send from node to node, local or distributed → Fan-in the output
  • 37. © 2017 Hazelcast Inc. Confidential & Proprietary Conclusions Stream Processing • Suitable when data arrives too fast to process after storing, or where you don’t care to store • Needs a much more functional programming style than tradition Java • → lambdas feature heavily • Java streams is ok, might be all you need • → makes good use of a single machine • Jet streams is better, for bigger volumes • → makes use of multiple machines • Jet is from Hazelcast • → easy to get going, deploy to bare metal or any cloud • Alternatives exist, such as Spark and Flink • → Jet is open-source, Java, faster, no Zookeeper
  • 38. © 2017 Hazelcast Inc. Confidential & Proprietary The End https://github.com/neilstevenson/jeeconf2017 neil@hazelcast.com https://jet.hazelcast.org/ https://github.com/hazelcast/hazelcast-jet Stack Overflow “hazelcast-jet” or Google Group https://gitter.im/hazelcast/home