SlideShare a Scribd company logo
@gamussa @hazelcast #oraclecode
IN-MEMORY ANALYTICS
with APACHE SPARK and
HAZELCAST
@gamussa @hazelcast #oraclecode
Solutions Architect
Developer Advocate
@gamussa in internetz
Please, follow me on Twitter
I’m very interesting ©
Who am I?
@gamussa @hazelcast #oraclecode
What’s Apache Spark?
Lightning-Fast Cluster Computing
@gamussa @hazelcast #oraclecode
Run programs up to 100x
faster than Hadoop
MapReduce in memory,
or 10x faster on disk.
@gamussa @hazelcast #oraclecode
When to use Spark?
Data Science Tasks
when questions are unknown
Data Processing Tasks
when you have to much data
You’re tired of Hadoop
@gamussa @hazelcast #oraclecode
Spark Architecture
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
@gamussa @hazelcast #oraclecode
Resilient Distributed Datasets (RDD)
are the primary abstraction in Spark –
a fault-tolerant collection of elements that can be
operated on in parallel
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD Operations
@gamussa @hazelcast #oraclecode
operations on RDDs:
transformations and actions
@gamussa @hazelcast #oraclecode
transformations are lazy
(not computed immediately)
the transformed RDD gets recomputed
when an action is run on it (default)
@gamussa @hazelcast #oraclecode
RDD
Transformations
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Actions
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Fault Tolerance
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Construction
@gamussa @hazelcast #oraclecode
parallelized collections
take an existing Scala collection
and run functions on it in parallel
@gamussa @hazelcast #oraclecode
Hadoop datasets
run functions on each record of a file in Hadoop distributed
file system or any other storage system supported by
Hadoop
@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
The Fastest In-memory Data Grid
@gamussa @hazelcast #oraclecode
Hazelcast IMDG
is an operational,
in-memory,
distributed computing platform
that manages data using
in-memory storage, and
performs parallel execution for
breakthrough application speed
and scale
@gamussa @hazelcast #oraclecode
High-Density
Caching
In-Memory
Data Grid
Web Session
Clustering
Microservices
Infrastructure
@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
In-memory Data Grid
Apache v2 Licensed
Distributed
Caches (IMap, JCache)
Java Collections (IList, ISet, IQueue)
Messaging (Topic, RingBuffer)
Computation (ExecutorService, M-R)
@gamussa @hazelcast #oraclecode
Green
Primary
Green
Backup
Green
Shard
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
Demo
@gamussa @hazelcast #oraclecode
LIMITATIONS
@gamussa @hazelcast #oraclecode
DATA SHOULD NOT BE
UPDATED WHILE READING
FROM SPARK
@gamussa @hazelcast #oraclecode
WHY ?
@gamussa @hazelcast #oraclecode
MAP EXPANSION
SHUFFLES THE DATA
INSIDE THE BUCKET
@gamussa @hazelcast #oraclecode
CURSOR DOESN’T POINT TO
CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSING
ENTRIES COULD OCCUR
@gamussa @hazelcast #oraclecode
github.com/hazelcast/hazelcast-spark
@gamussa @hazelcast #oraclecode
THANKS!
Any questions?
You can find me at
@gamussa
viktor@hazelcast.com

More Related Content

PDF
[OracleCode - SF] Distributed caching for your next node.js project
PDF
PPTX
Ignite Your Big Data With a Spark!
PDF
Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L...
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PPTX
10 Things About Spark
PPTX
Spark, Tachyon and Mesos internals
PDF
Hadoop at ayasdi
[OracleCode - SF] Distributed caching for your next node.js project
Ignite Your Big Data With a Spark!
Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
10 Things About Spark
Spark, Tachyon and Mesos internals
Hadoop at ayasdi

What's hot (20)

PPTX
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
PPTX
Scylla @ GumGum: Contextual Ads
PDF
Wide Column Store NoSQL vs SQL Data Modeling
PPTX
Empowering the AWS DynamoDB™ application developer with Alternator
PDF
OOW Unconference 2010: Mining the AWR repository for Capacity Planning, Visua...
PDF
Scylla: 1 Million CQL operations per second per server
PPTX
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
PPTX
Apache spark online training - GoLogica
PDF
Managing your Black Friday Logs
PPTX
Meeting the challenges of OLTP Big Data with Scylla
PDF
AWS Summit Milan - AWS RDS for your data (and your sleep)
PPTX
Redshift Introduction
PPTX
Lessons learned from embedding Cassandra in xPatterns
PDF
Case Study: Troubleshooting Cassandra performance issues as a developer
PDF
Hadoop + GPU
PDF
Building Data Quality pipelines with Apache Spark and Delta Lake
PDF
Introduction to df
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
PPTX
«Почему Spark отнюдь не так хорош»
PDF
ScyllaDB: NoSQL at Ludicrous Speed
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
Scylla @ GumGum: Contextual Ads
Wide Column Store NoSQL vs SQL Data Modeling
Empowering the AWS DynamoDB™ application developer with Alternator
OOW Unconference 2010: Mining the AWR repository for Capacity Planning, Visua...
Scylla: 1 Million CQL operations per second per server
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Apache spark online training - GoLogica
Managing your Black Friday Logs
Meeting the challenges of OLTP Big Data with Scylla
AWS Summit Milan - AWS RDS for your data (and your sleep)
Redshift Introduction
Lessons learned from embedding Cassandra in xPatterns
Case Study: Troubleshooting Cassandra performance issues as a developer
Hadoop + GPU
Building Data Quality pipelines with Apache Spark and Delta Lake
Introduction to df
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
«Почему Spark отнюдь не так хорош»
ScyllaDB: NoSQL at Ludicrous Speed
Ad

Viewers also liked (20)

PDF
Streamsets and spark
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
Akka-chan's Survival Guide for the Streaming World
PDF
Introduction to data flow management using apache nifi
PDF
[Jfokus] Riding the Jet Streams
PDF
[JokerConf] Верхом на реактивных стримах, 10/13/2016
PPTX
[NYJavaSig] Riding the Distributed Streams - Feb 2nd, 2017
PPTX
[Codemash] Caching Made "Bootiful"!
PPTX
Think Distributed: The Hazelcast Way
PPTX
Hazelcast Essentials
PPTX
Apache Spark and Oracle Stream Analytics
PDF
Complex Event Processing with Esper
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
PDF
Dive into Spark Streaming
PDF
Streaming all the things with akka streams
PDF
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
PDF
The Power of the Log
PPTX
Kafka & Couchbase Integration Patterns
PPTX
Kudu Forrester Webinar
Streamsets and spark
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Akka-chan's Survival Guide for the Streaming World
Introduction to data flow management using apache nifi
[Jfokus] Riding the Jet Streams
[JokerConf] Верхом на реактивных стримах, 10/13/2016
[NYJavaSig] Riding the Distributed Streams - Feb 2nd, 2017
[Codemash] Caching Made "Bootiful"!
Think Distributed: The Hazelcast Way
Hazelcast Essentials
Apache Spark and Oracle Stream Analytics
Complex Event Processing with Esper
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
Dive into Spark Streaming
Streaming all the things with akka streams
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
The Power of the Log
Kafka & Couchbase Integration Patterns
Kudu Forrester Webinar
Ad

Similar to [OracleCode SF] In memory analytics with apache spark and hazelcast (20)

PDF
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
PPTX
Intro to Spark development
PDF
Introduction to Spark Training
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Big data clustering
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
How Apache Spark fits into the Big Data landscape
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
Introduction to hazelcast
PDF
Scala Meetup Hamburg - Spark
PPTX
Distributed caching-computing v3.8
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
PPTX
Apache Spark Core
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PPTX
Dec6 meetup spark presentation
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Intro to Spark development
Introduction to Spark Training
Spark Summit East 2015 Advanced Devops Student Slides
Big data clustering
Intro to Apache Spark
Intro to Apache Spark
Unit II Real Time Data Processing tools.pptx
How Apache Spark fits into the Big Data landscape
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Introduction to hazelcast
Scala Meetup Hamburg - Spark
Distributed caching-computing v3.8
Lighting up Big Data Analytics with Apache Spark in Azure
Apache Spark Core
Spark Concepts - Spark SQL, Graphx, Streaming
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Dec6 meetup spark presentation
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

More from Viktor Gamov (11)

PDF
[DataSciCon] Divide, distribute and conquer stream v. batch
PDF
[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch
PDF
Testing containers with TestContainers @ AJUG 7/18/2017
PDF
Distributed caching for your next node.js project cf summit - 06-15-2017
PDF
[Philly ETE] Java Puzzlers NG
PDF
Распределяй и властвуй — 2: Потоки данных наносят ответный удар
PDF
[JBreak] Блеск И Нищета Распределенных Стримов - 04-04-2017
PDF
JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»
PDF
WebSockets: The Current State of the Most Valuable HTML5 API for Java Developers
KEY
Functional UI testing of Adobe Flex RIA
KEY
Testing Flex RIAs for NJ Flex user group
[DataSciCon] Divide, distribute and conquer stream v. batch
[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch
Testing containers with TestContainers @ AJUG 7/18/2017
Distributed caching for your next node.js project cf summit - 06-15-2017
[Philly ETE] Java Puzzlers NG
Распределяй и властвуй — 2: Потоки данных наносят ответный удар
[JBreak] Блеск И Нищета Распределенных Стримов - 04-04-2017
JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»
WebSockets: The Current State of the Most Valuable HTML5 API for Java Developers
Functional UI testing of Adobe Flex RIA
Testing Flex RIAs for NJ Flex user group

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx

[OracleCode SF] In memory analytics with apache spark and hazelcast