SlideShare a Scribd company logo
PRESENTED BY
Redis + Spark Structured Streaming:
A Perfect Combination to Scale-out Your Continuous
Applications
Dave Nielsen
Redis Labs
PRESENTED BY
Agenda:
How to collect and process data stream in real-time at scale
IoT
User Activity
Messages
PRESENTED BY
http://bit.ly/spark-redis
PRESENTED BY
http://bit.ly/spark-redis
PRESENTED BY
Breaking up Our Solution into Functional Blocks
Click data
Record all clicks Count clicks in real-time Query clicks by assets
2. Data Processing1. Data Ingest 3. Data Querying
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
The Actual Building Blocks of Our Solution
Click data
PRESENTED BY
1. Data Ingest
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Ingest using Redis Streams
PRESENTED BY
What is Redis Streams?
PRESENTED BY
Redis Streams in its Simplest Form
ConsumerProducer
PRESENTED BY
Redis Streams can Connect Many Producers and Consumers
Producer 2
Producer m
Producer 1
Producer 3
Consumer 1
Consumer n
Consumer 2
Consumer 3
PRESENTED BY
Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets
Pub/Sub
• Fire and forget
• No persistence
• No lookback queries
Lists
• Tight coupling between
producers and consumers
• Persistence for transient
data only
• No lookback queries
Sorted Sets
• Data ordering isn’t built-in;
producer controls the order
• No maximum limit
• The data structure is not
designed to handle data
streams
PRESENTED BY
What is Redis Streams?
Pub/Sub Lists Sorted Sets
It is like Pub/Sub, but
with persistence
It is like Lists, but decouples
producers and consumers
It is like Sorted Sets, but
asynchronous
+
• Lifecycle management of streaming data
• Built-in support for timeseries data
• A rich choice of options to the consumers to read streaming and static data
• Super fast lookback queries powered by radix trees
• Automatic eviction of data based on the upper limit
PRESENTED BY
Redis Streams Benefits
Analytics
Data Backup
Consumers
Producer
Messaging
It enables asynchronous data exchange between producers and consumers
and historical range queries
PRESENTED BY
Redis Streams Benefits
Producer
Image Processor
Arrival Rate: 500/sec
Consumption Rate: 500/sec
Image Processor
Image Processor
Image Processor
Image Processor
Redis Stream
With consumer groups, you can scale out and avoid backlogs
PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging
Redis Streams Benefits
Simplify data collection, processing and
distribution to support complex scenarios
PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging
PRESENTED BY
Our Ingest Solution
Redis Stream
1. Data Ingest
Command
xadd clickstream * img [image_id]
Sample data
127.0.0.1:6379> xrange clickstream - +
1) 1) "1553536458910-0"
2) 1) ”image_1"
2) "1"
2) 1) "1553536469080-0"
2) 1) ”image_3"
2) "1"
3) 1) "1553536489620-0"
2) 1) ”image_3"
2) "1”
.
.
.
.
PRESENTED BY
2. Data Processing
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Processing using Spark’s Structured Streaming
PRESENTED BY
What is Structured Streaming?
PRESENTED BY
“Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream
processing without the user having to reason about
streaming.”
Definition
PRESENTED BY
How Structured Streaming Works?
Micro-batches as
DataFrames (tables)
Source: Data Stream
DataFrame Operations
Selection: df.select(“xyz”).where(“a > 10”)
Filtering: df.filter(_.a > 10).map(_.b)
Aggregation: df.groupBy(”xyz").count()
Windowing: df.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"”
).count()
Deduplication: df.dropDuplicates("guid")
Output Sink
Spark Structured Streaming
PRESENTED BY
ClickAnalyzer
Redis Stream Redis HashStructured Stream Processing
Redis Streams as data source
Spark-Redis Library
Redis as data sink
 Developed using Scala
 Compatible with Spark 2.3 and higher
 Supports
• RDD
• DataFrames
• Structured Streaming
PRESENTED BY
1. Connect to the Redis instance
2. Map Redis Stream to Structured Streaming schema
3. Create the query object
4. Run the query
Steps for Using Redis Streams as Data Source
PRESENTED BY
Redis Streams as Data Source
1. Connect to the Redis instance
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
PRESENTED BY
Redis Streams as Data Source
2. Map Redis Stream to Structured Streaming schema
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
xadd clickstream * img [image_id]
PRESENTED BY
Redis Streams as Data Source
3. Create the query object
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
PRESENTED BY
Redis Streams as Data Source
4. Run the query
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379")
val query = queryByImg.writeStream
.outputMode("update")
.foreach(clickWriter)
.start()
query.awaitTermination()
Custom output sink
PRESENTED BY
How to Setup Redis as Output Sink
override def process(record: Row) = {
var img = record.getString(0);
var count = record.getLong(1);
if(jedis == null){
connect()
}
jedis.hset("clicks:"+img, "img", img)
jedis.hset("clicks:"+img, "count", count.toString)
}
Create a custom class extending ForeachWriter and override the method, process()
Save as Hash with structure
clicks:[image]
img [image]
count [count]
Example
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Table: Clicks
PRESENTED BY
3. Data Querying
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Query Redis using Spark SQL
PRESENTED BY
1. Initialize Spark Context with Redis
2. Create table
3. Run Query
Steps to Query Redis using Spark SQL
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Redis Hash to SQL mapping
PRESENTED BY
1. Initialize
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder().appName("redis-
test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate()
scala> val sc = spark.sparkContext
scala> import spark.sql
scala> import spark.implicits._
2. Create table
scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table
'clicks’)”)
How to Query Redis using Spark SQL
PRESENTED BY
3. Run Query
scala> sql("select * from clicks").show();
+----------+-----+
| img|count|
+----------+-----+
|image_1001| 1029|
|image_1002| 392|
|. | .|
|. | .|
|. | .|
|. | .|
+----------+-----+
How to Query Redis using Spark SQL
PRESENTED BY
Code
Email dave@redislabs.com for this slide deck
Or download from https://github.com/redislabsdemo/
PRESENTED BY
Recap
PRESENTED BY
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Building Blocks of our Solution
Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:
PRESENTED BY
Questions
?
?
?
?
?
?
?
?
?
?
?
Thank you!
dave@redislabs.com
@davenielsen
Dave Nielsen

More Related Content

PPTX
AWS Lake Formation Deep Dive
PDF
Snowflake: The most cost-effective agile and scalable data warehouse ever!
PPTX
Intro to Neo4j
PPT
Tableau Architecture
PDF
[XConf Brasil 2020] Data mesh
PPTX
Azure Data Factory Data Flow
PDF
Databricks Delta Lake and Its Benefits
PDF
When NOT to use Apache Kafka?
AWS Lake Formation Deep Dive
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Intro to Neo4j
Tableau Architecture
[XConf Brasil 2020] Data mesh
Azure Data Factory Data Flow
Databricks Delta Lake and Its Benefits
When NOT to use Apache Kafka?

What's hot (20)

PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
Tableau Visual Guidebook
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PDF
SQL vs. NoSQL Databases
PDF
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Microsoft Power BI Copilot
PPTX
Azure Data Factory Data Flows Training v005
PPTX
Snowflake Architecture.pptx
PPTX
Delta Lake with Azure Databricks
PPTX
Monitor Azure HDInsight with Azure Log Analytics
PDF
Data Visualization With Tableau | Edureka
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PPTX
Data Mesh using Microsoft Fabric
PDF
Best Practices for Team Development in a Single Org
PPTX
Tableau Server Basics
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Apache Pulsar Development 101 with Python
PPTX
An introduction to QuerySurge webinar
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
SQL Analytics Powering Telemetry Analysis at Comcast
Tableau Visual Guidebook
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
SQL vs. NoSQL Databases
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Enabling a Data Mesh Architecture with Data Virtualization
Microsoft Power BI Copilot
Azure Data Factory Data Flows Training v005
Snowflake Architecture.pptx
Delta Lake with Azure Databricks
Monitor Azure HDInsight with Azure Log Analytics
Data Visualization With Tableau | Edureka
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Data Mesh using Microsoft Fabric
Best Practices for Team Development in a Single Org
Tableau Server Basics
Data Mesh Part 4 Monolith to Mesh
Apache Pulsar Development 101 with Python
An introduction to QuerySurge webinar
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Ad

Similar to Redis Streams plus Spark Structured Streaming (20)

PDF
Redis+Spark Structured Streaming: Roshan Kumar
PDF
The Scout24 Data Platform (A Technical Deep Dive)
PDF
As You Seek – How Search Enables Big Data Analytics
PDF
Log everything! @DC13
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PDF
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
PDF
Top 8 WCM Trends 2010
PDF
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
PDF
Making App Developers More Productive
PDF
Scaling and Modernizing Data Platform with Databricks
PPTX
Microsoft Azure Big Data Analytics
PDF
RICOH THETA x IoT Developers Contest : Cloud API Seminar (2nd installation)
PDF
Real-time big data analytics based on product recommendations case study
PPTX
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Big Data Expo 2015 - Gigaspaces Making Sense of it all
PDF
Yahoo’s next generation user profile platform
PDF
Yahoo's Next Generation User Profile Platform
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PPT
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Redis+Spark Structured Streaming: Roshan Kumar
The Scout24 Data Platform (A Technical Deep Dive)
As You Seek – How Search Enables Big Data Analytics
Log everything! @DC13
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
Top 8 WCM Trends 2010
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
Making App Developers More Productive
Scaling and Modernizing Data Platform with Databricks
Microsoft Azure Big Data Analytics
RICOH THETA x IoT Developers Contest : Cloud API Seminar (2nd installation)
Real-time big data analytics based on product recommendations case study
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
Real-Time Spark: From Interactive Queries to Streaming
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Yahoo’s next generation user profile platform
Yahoo's Next Generation User Profile Platform
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Ad

More from Dave Nielsen (12)

PPTX
10 Ways to Scale with Redis - LA Redis Meetup 2019
PPTX
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
PPTX
Microservices - Is it time to breakup?
PPTX
Add Redis to Postgres to Make Your Microservices Go Boom!
PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
PDF
Redis as a Main Database, Scaling and HA
PPTX
Redis Functions, Data Structures for Web Scale Apps
PPT
Cloud Storage API
PPT
Mashery
PPT
Google App Engine
PPT
Unified Cloud Storage Api
PPT
Integrating Wikis And Other Social Content
10 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
Microservices - Is it time to breakup?
Add Redis to Postgres to Make Your Microservices Go Boom!
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Redis as a Main Database, Scaling and HA
Redis Functions, Data Structures for Web Scale Apps
Cloud Storage API
Mashery
Google App Engine
Unified Cloud Storage Api
Integrating Wikis And Other Social Content

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Global journeys: estimating international migration
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Launch Your Data Science Career in Kochi – 2025
Global journeys: estimating international migration

Redis Streams plus Spark Structured Streaming

  • 1. PRESENTED BY Redis + Spark Structured Streaming: A Perfect Combination to Scale-out Your Continuous Applications Dave Nielsen Redis Labs
  • 2. PRESENTED BY Agenda: How to collect and process data stream in real-time at scale IoT User Activity Messages
  • 5. PRESENTED BY Breaking up Our Solution into Functional Blocks Click data Record all clicks Count clicks in real-time Query clicks by assets 2. Data Processing1. Data Ingest 3. Data Querying
  • 6. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying The Actual Building Blocks of Our Solution Click data
  • 8. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Ingest using Redis Streams
  • 9. PRESENTED BY What is Redis Streams?
  • 10. PRESENTED BY Redis Streams in its Simplest Form ConsumerProducer
  • 11. PRESENTED BY Redis Streams can Connect Many Producers and Consumers Producer 2 Producer m Producer 1 Producer 3 Consumer 1 Consumer n Consumer 2 Consumer 3
  • 12. PRESENTED BY Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets Pub/Sub • Fire and forget • No persistence • No lookback queries Lists • Tight coupling between producers and consumers • Persistence for transient data only • No lookback queries Sorted Sets • Data ordering isn’t built-in; producer controls the order • No maximum limit • The data structure is not designed to handle data streams
  • 13. PRESENTED BY What is Redis Streams? Pub/Sub Lists Sorted Sets It is like Pub/Sub, but with persistence It is like Lists, but decouples producers and consumers It is like Sorted Sets, but asynchronous + • Lifecycle management of streaming data • Built-in support for timeseries data • A rich choice of options to the consumers to read streaming and static data • Super fast lookback queries powered by radix trees • Automatic eviction of data based on the upper limit
  • 14. PRESENTED BY Redis Streams Benefits Analytics Data Backup Consumers Producer Messaging It enables asynchronous data exchange between producers and consumers and historical range queries
  • 15. PRESENTED BY Redis Streams Benefits Producer Image Processor Arrival Rate: 500/sec Consumption Rate: 500/sec Image Processor Image Processor Image Processor Image Processor Redis Stream With consumer groups, you can scale out and avoid backlogs
  • 16. PRESENTED BY Classifier 1 Classifier 2 Classifier n Consumer Group XREADGROUP XREAD Consumers Producer 2 Producer m Producer 1 Producer 3 XADD XACK Deep Learning-based Classification Analytics Data Backup Messaging Redis Streams Benefits Simplify data collection, processing and distribution to support complex scenarios
  • 17. PRESENTED BY Classifier 1 Classifier 2 Classifier n Consumer Group XREADGROUP XREAD Consumers Producer 2 Producer m Producer 1 Producer 3 XADD XACK Deep Learning-based Classification Analytics Data Backup Messaging
  • 18. PRESENTED BY Our Ingest Solution Redis Stream 1. Data Ingest Command xadd clickstream * img [image_id] Sample data 127.0.0.1:6379> xrange clickstream - + 1) 1) "1553536458910-0" 2) 1) ”image_1" 2) "1" 2) 1) "1553536469080-0" 2) 1) ”image_3" 2) "1" 3) 1) "1553536489620-0" 2) 1) ”image_3" 2) "1” . . . .
  • 19. PRESENTED BY 2. Data Processing
  • 20. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Processing using Spark’s Structured Streaming
  • 21. PRESENTED BY What is Structured Streaming?
  • 22. PRESENTED BY “Structured Streaming provides fast, scalable, fault- tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.” Definition
  • 23. PRESENTED BY How Structured Streaming Works? Micro-batches as DataFrames (tables) Source: Data Stream DataFrame Operations Selection: df.select(“xyz”).where(“a > 10”) Filtering: df.filter(_.a > 10).map(_.b) Aggregation: df.groupBy(”xyz").count() Windowing: df.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word"” ).count() Deduplication: df.dropDuplicates("guid") Output Sink Spark Structured Streaming
  • 24. PRESENTED BY ClickAnalyzer Redis Stream Redis HashStructured Stream Processing Redis Streams as data source Spark-Redis Library Redis as data sink  Developed using Scala  Compatible with Spark 2.3 and higher  Supports • RDD • DataFrames • Structured Streaming
  • 25. PRESENTED BY 1. Connect to the Redis instance 2. Map Redis Stream to Structured Streaming schema 3. Create the query object 4. Run the query Steps for Using Redis Streams as Data Source
  • 26. PRESENTED BY Redis Streams as Data Source 1. Connect to the Redis instance val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 27. PRESENTED BY Redis Streams as Data Source 2. Map Redis Stream to Structured Streaming schema val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count xadd clickstream * img [image_id]
  • 28. PRESENTED BY Redis Streams as Data Source 3. Create the query object val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 29. PRESENTED BY Redis Streams as Data Source 4. Run the query val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379") val query = queryByImg.writeStream .outputMode("update") .foreach(clickWriter) .start() query.awaitTermination() Custom output sink
  • 30. PRESENTED BY How to Setup Redis as Output Sink override def process(record: Row) = { var img = record.getString(0); var count = record.getLong(1); if(jedis == null){ connect() } jedis.hset("clicks:"+img, "img", img) jedis.hset("clicks:"+img, "count", count.toString) } Create a custom class extending ForeachWriter and override the method, process() Save as Hash with structure clicks:[image] img [image] count [count] Example clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . img count image_1001 1029 image_1002 392 . . . . Table: Clicks
  • 32. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Query Redis using Spark SQL
  • 33. PRESENTED BY 1. Initialize Spark Context with Redis 2. Create table 3. Run Query Steps to Query Redis using Spark SQL clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . . img count image_1001 1029 image_1002 392 . . . . Redis Hash to SQL mapping
  • 34. PRESENTED BY 1. Initialize scala> import org.apache.spark.sql.SparkSession scala> val spark = SparkSession.builder().appName("redis- test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate() scala> val sc = spark.sparkContext scala> import spark.sql scala> import spark.implicits._ 2. Create table scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table 'clicks’)”) How to Query Redis using Spark SQL
  • 35. PRESENTED BY 3. Run Query scala> sql("select * from clicks").show(); +----------+-----+ | img|count| +----------+-----+ |image_1001| 1029| |image_1002| 392| |. | .| |. | .| |. | .| |. | .| +----------+-----+ How to Query Redis using Spark SQL
  • 36. PRESENTED BY Code Email dave@redislabs.com for this slide deck Or download from https://github.com/redislabsdemo/
  • 39. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Building Blocks of our Solution Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:

Editor's Notes

  • #3: Agendaa: Takeaway: Use Redis Streams + Spark-Redis + Structured Streaming with Microbatches in Spark to collect and process data stream in real-time at scale
  • #4: Call the spark engine with spark-submit in scala directory: spark-submit --class com.redislabs.streaming.ClickAnalysis --jars ./lib/spark-redis-2.4.0-SNAPSHOT-jar-with-dependencies.jar --master local[*] ./target/scala-2.11/redisexample_2.11-1.0.jar 4 datastructures clickstream – redis stream data structure to collect all clicks $ XLEN clickstream Spark 2.3 introduced Structured streaming that does microbatches – collect from streams, such as Redis Streams or Kafka A few messages every few milliseconds. Then Spark runs queries to aggregate and collecting and storing somewhere, such as Redis Trick – Any hash data structure that starts with clicks: belongs to a table called Clicks $ HGETALL clicks:image_1 Fields called image and count are columns in the table Configure Spark SQL so it knows that any field that starts with clicks belongs to a table called Clicks $ HMSET clicks:image_test img test count 10000
  • #5: Summarize demo
  • #6: Read then go to next slide
  • #7: What are the functional blocks: Data ingest-collect all clicks without losing any – Redis Cloud w/Streams  free up to 30 MB Spark-Redis into Spark Process data in real-time – Spark w/Structured Streaming for Microbatches Data Querying – some kind of custom chart, leaderboard, or Grafana  Using Spark SQL with Redis Cloud again
  • #11: Cover at high level Connects producers with consumers. May have many of either
  • #12: Redis Streams supports both asynchronous communication and look-back queries
  • #13: How many have used pubsub, list or sorted sets Pubsub – no lookback queries. All asynchrous List – one list cannot support many consumers. Sorted Sets – solves problem  don’t need copy for each consumer, but have to always poll for data. Can use blocking call, but transforms into a list. For streaming you have to poll.
  • #14: Redis Streams manages the life cycle of the streaming data effectively (Example: Consumer groups and their commands XREADGROUP, XACK and XCLAIM ensure every data object is consumed properly) It offers consumers a rich choice of options from to consume the data – they can read from where they left off, or only the new data, or from the beginning The lookback queries are super fast as they are powered by the radix trees Kafka or Kenisis have a timeframe limit. Redis has no timeframe, you can cap by maxlength / size
  • #15: If you have different types of consumers …
  • #16: Ex: toll booth – can back up – but we can match rate of arrival with rate of departure
  • #18: So this is our stream example
  • #19: clickstream is the stream key Lets redis create the timestamp img is the field So that’s how data ingest works. Any questions?
  • #20: Stop and run query to see latest count scala> sql("select * from clicks").show(); $ DEL clicks:image_test
  • #23: Marketing definition from databricks
  • #24: With Structured Streaming, Spark pulls data in micro-batches (like a table) Every microbatch has rows and each row is like a dataframe so you can run dataframe operations on these microbatches Windowing – can aggregate for last 30 mins Can also dedupe Go to databricks website to see more I’m doing aggregations in my demo Used to only do batches Now in 2.4 microbatches- batches in miliseconds Now in experimental mode – is continuous processing, get dataframes in microsections Finally can define an Output sink, Or output to console 3 modes: dump everything, append only, or update sync,
  • #25: Redis Labs developed and supports this open source library Data Source and Data Sync Redis streams Dump data into Redis Query Redis from Spark All written in Scala
  • #26: How do you use Redis as a Datasource? There are 4 steps Connect to Redis db Map Redis Streams keyvaalue pairs to microbatch Query object Run the query in a loop
  • #27: Connect to Redis Cloud Move to next sldie
  • #28: 2. Interpreting the stream data Clickstream is key name Img is field name
  • #29: 3. Defining the query Group by img Count
  • #30: 4. Dumping into a ForEachWriter - Custom – see next slide -
  • #32: Stop and run query to see latest count scala> sql("select * from clicks").show();
  • #33: Any questions? How do you query redis? Can connect ODBC drivers to Spark and can query Redis?
  • #34: How to connect to Redis How to map to Redis? Run Query
  • #35: 1. 2. Create a table (do it only once). Tell it to use class name org.apache.spark.sql.redis Map to table ‘clicks’
  • #36: 3. Run query