SlideShare a Scribd company logo
Stream, Stream, Stream:
Different Streaming methods with Spark and Kafka
Itai Yaffe
Nielsen
Introduction
Itai Yaffe
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Attended our session yesterday about counting
unique users with Druid?
● Working with Spark/Kafka? Planning to?
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?
Nielsen Marketing Cloud - high-level architecture
Data flow in the old days...
In-DB aggregation
OLAP
Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enforce schema
● Scale-related issues, e.g:
○ Had to “manually” scale the processes
That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015)
In-DB aggregation
OLAP
Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues
Data flow - the modern way
+
Photography Copyright: NBC
Read Messages
In-DB aggregation
OLAP
The need for stateful streaming
Fast forward a few months...
●New requirements were being raised
●Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark Streaming app
Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregate current micro-batch
3.
Write combined aggregated data 4.
Read aggregated data
From HDFS every X micro-batches
OLAP
Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
●Enables running continuous, incremental processes
○ Basically manages the state for you
●Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
●Many other features
●Was in ALPHA mode in 2.0 and 2.1
Structured Streaming
Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (offsets and state) handled internally by Spark
1.
Read Messages
4.
Upsert aggregated data
(on window end)
Structured
streaming
OLAP
Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
○ https://issues.apache.org/jira/browse/SPARK-19677
○ https://issues.apache.org/jira/browse/SPARK-19407
● Checkpointing to S3 wasn’t straight-forward
○ Tried using EMRFS consistent view
■ Worked for stateless apps
■ Encountered sporadic issues for stateful apps
Structured Streaming - strengths and weaknesses (IMO)
● Strengths include :
○ Running incremental, continuous processing
○ Increased performance (e.g via Catalyst SQL optimizer)
○ Massive efforts are invested in it
● Weaknesses were mostly related to maturity
Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
WriteFiles
2.
Aggregate Current micro-batch
4.
Load Data
OLAP
Cool, so… Why can’t we stop here?
● Significantly underutilized cluster resources = wasted $$$
Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory
● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it even less frequently
○ Remember - longer micro-batches result in a better aggregation ratio
Enter “streaming” over RDR
RDR (or Raw Data Repository) is our Data Lake
●Kafka topic messages are stored on S3 in Parquet format
●RDR Loaders - stateless Spark Streaming applications
●Applications can read data from RDR for various use-cases
○ E.g analyzing data of the last 30 days
Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topic with the files’ paths as messages
How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files
How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load Data
Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Application type Day 1 Day 2 Day 3
Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$
“Streaming” over RDR app 150.08$ 198.73$ 174.68$
Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messages from Kafka, but now we only read
about 1K messages per hour (rather than ~300M)
● The new infra doesn’t depend on the integration of Spark Streaming with Kafka
○ No more weird exceptions...
● We can run the Spark batch applications as (in)frequent as we’d like
Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Streaming & Kafka for “stateless” use-cases
○ Quickly needed to handle stateful use-cases as well
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Required us to manage the state on our own
● Moved to Structured Streaming (for all use-cases)
○ Cons were mostly related to maturity
Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
○ Under-utilized Spark clusters
○ Etc.
● Introduced “streaming” over our Data Lake
○ Eliminated Kafka performance penalty
○ Spark clusters are much better utilized = $$$ saved
○ And more...
DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in Big Data.
■ To grow women representation in Big Data field > 25% by 2020
○ Visit the website (https://www.womeninbigdata.org/)
● Counting Unique Users in Real-Time: Here’s a Challenge for You!
○ Presented yesterday, http://tinyurl.com/yxjc72af
● NMC Tech Blog - https://medium.com/nmc-techblog
QUESTIONS
https://www.linkedin.com/in/itaiy/
THANK YOU
Structured Streaming -
additional slides
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Data stream
Unbounded Table
New data in the data streamer
=
New rows appended to a unbounded table
Data stream as an unbonded table
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Structured Streaming - WordCount example
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes
Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
Monitoring
Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow New architecture New flow
Existing
Spark app
Periodic Spark batch job Read Parquet from S3
-> Transform ->
Write Parquet to S3
Stateless Structured
Streaming
Read from Kafka ->
Transform ->
Write Parquet to S3
Existing Java
app
Periodic standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and
aggregate -> Write to
RDBMS
Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
New app N/A N/A Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS

More Related Content

PDF
Journeys from Kafka to Parquet
PPTX
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
PDF
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
PPTX
Streaming in the Wild with Apache Flink
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
Building real time data-driven products
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Journeys from Kafka to Parquet
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Streaming in the Wild with Apache Flink
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Building real time data-driven products
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

What's hot (20)

PDF
Pinot: Near Realtime Analytics @ Uber
PDF
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PPTX
Data Pipeline at Tapad
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
PDF
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
PPTX
Storage Requirements and Options for Running Spark on Kubernetes
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
PPTX
Large-scaled telematics analytics
PPTX
Flink vs. Spark
PPTX
Scaling HDFS at Xiaomi
PDF
Hadoop application architectures - using Customer 360 as an example
PDF
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPT
The Evolution of Big Data Pipelines at Intuit
Pinot: Near Realtime Analytics @ Uber
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Unified, Efficient, and Portable Data Processing with Apache Beam
Data Pipeline at Tapad
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Why apache Flink is the 4G of Big Data Analytics Frameworks
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Storage Requirements and Options for Running Spark on Kubernetes
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Large-scaled telematics analytics
Flink vs. Spark
Scaling HDFS at Xiaomi
Hadoop application architectures - using Customer 360 as an example
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Streaming Analytics with Spark, Kafka, Cassandra and Akka
The Evolution of Big Data Pipelines at Intuit
Ad

Similar to Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka (20)

PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
ODP
Stream processing using Kafka
PDF
Spark Driven Big Data Analytics
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
SnappyData Toronto Meetup Nov 2017
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
The Netflix Way to deal with Big Data Problems
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PDF
Scala Days Highlights | BoldRadius
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
Introduction to Spark Streaming
PDF
Scala like distributed collections - dumping time-series data with apache spark
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
PDF
Netflix Keystone—Cloud scale event processing pipeline
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream processing using Kafka
Spark Driven Big Data Analytics
Spark Summit EU 2015: Lessons from 300+ production users
SnappyData Toronto Meetup Nov 2017
Netflix Open Source Meetup Season 4 Episode 2
AWS Big Data Demystified #1: Big data architecture lessons learned
The Netflix Way to deal with Big Data Problems
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
The Future of Fast Databases: Lessons from a Decade of QuestDB
Scala Days Highlights | BoldRadius
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Introduction to Spark Streaming
Scala like distributed collections - dumping time-series data with apache spark
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Netflix Keystone—Cloud scale event processing pipeline
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Diabetes mellitus diagnosis method based random forest with bat algorithm

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

  • 1. Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Itai Yaffe Nielsen
  • 2. Introduction Itai Yaffe ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012
  • 3. Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Attended our session yesterday about counting unique users with Druid? ● Working with Spark/Kafka? Planning to?
  • 4. Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ High-level architecture ● Data flow - past and present ● Spark Streaming ○ “Stateless” and “stateful” use-cases ● Spark Structured Streaming ● “Streaming” over our Data Lake
  • 5. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  • 6. Nielsen Marketing Cloud - questions we try to answer 1. How many unique users of a certain profile can we reach? E.g campaign for young women who love tech 2. How many impressions a campaign received?
  • 7. Nielsen Marketing Cloud - high-level architecture
  • 8. Data flow in the old days... In-DB aggregation OLAP
  • 9. Data flow in the old days… What’s wrong with that? ● CSV-related issues, e.g: ○ Truncated lines in input files ○ Can’t enforce schema ● Scale-related issues, e.g: ○ Had to “manually” scale the processes
  • 10. That's one small step for [a] man… (2014) “Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015) In-DB aggregation OLAP
  • 11. Why just a small step? ● Solved the scaling issues ● Still faced the CSV-related issues
  • 12. Data flow - the modern way + Photography Copyright: NBC
  • 14. The need for stateful streaming Fast forward a few months... ●New requirements were being raised ●Specific use-case : ○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of the aggregative operations to our Spark Streaming app
  • 15. Stateful streaming via “local” aggregations 1. Read Messages 5. Upsert aggregated data (every X micro-batches) 2. Aggregate current micro-batch 3. Write combined aggregated data 4. Read aggregated data From HDFS every X micro-batches OLAP
  • 16. Stateful streaming via “local” aggregations ● Required us to manage the state on our own ● Error-prone ○ E.g what if my cluster is terminated and data on HDFS is lost? ● Complicates the code ○ Mixed input sources for the same app (Kafka + files) ● Possible performance impact ○ Might cause the Kafka consumer to lag
  • 17. Structured Streaming - to the rescue? Spark 2.0 introduced Structured Streaming ●Enables running continuous, incremental processes ○ Basically manages the state for you ●Built on Spark SQL ○ DataFrame/Dataset API ○ Catalyst Optimizer ●Many other features ●Was in ALPHA mode in 2.0 and 2.1 Structured Streaming
  • 18. Structured Streaming - stateful app use-case 2. Aggregate current window 3. Checkpoint (offsets and state) handled internally by Spark 1. Read Messages 4. Upsert aggregated data (on window end) Structured streaming OLAP
  • 19. Structured Streaming - known issues & tips ● 3 major issues we had in 2.1.0 (solved in 2.1.1) : ○ https://issues.apache.org/jira/browse/SPARK-19517 ○ https://issues.apache.org/jira/browse/SPARK-19677 ○ https://issues.apache.org/jira/browse/SPARK-19407 ● Checkpointing to S3 wasn’t straight-forward ○ Tried using EMRFS consistent view ■ Worked for stateless apps ■ Encountered sporadic issues for stateful apps
  • 20. Structured Streaming - strengths and weaknesses (IMO) ● Strengths include : ○ Running incremental, continuous processing ○ Increased performance (e.g via Catalyst SQL optimizer) ○ Massive efforts are invested in it ● Weaknesses were mostly related to maturity
  • 21. Back to the future - Spark Streaming revived for “stateful” app use-case 1. Read Messages 3. WriteFiles 2. Aggregate Current micro-batch 4. Load Data OLAP
  • 22. Cool, so… Why can’t we stop here? ● Significantly underutilized cluster resources = wasted $$$
  • 23. Cool, so… Why can’t we stop here? (cont.) ● Extreme load of Kafka brokers’ disks ○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory ● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration ○ Forced us to use 1 core per executor to avoid it ○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) ● We wish we could run it even less frequently ○ Remember - longer micro-batches result in a better aggregation ratio
  • 24. Enter “streaming” over RDR RDR (or Raw Data Repository) is our Data Lake ●Kafka topic messages are stored on S3 in Parquet format ●RDR Loaders - stateless Spark Streaming applications ●Applications can read data from RDR for various use-cases ○ E.g analyzing data of the last 30 days Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
  • 25. How do we “stream” RDR files - producer side S3 RDRRDR Loaders 2. Write files 1. Read Messages 3. Write files’ paths Topic with the files’ paths as messages
  • 26. How do we “stream” RDR files - consumer side S3 RDR 3. Process files 1. Read files’ paths 2. Read RDR files
  • 27. How do we use the new RDR “streaming” infrastructure? 1. Read files’ paths 3. Write files 2. Read RDR files OLAP 4. Load Data
  • 28. Did we solve the aforementioned problems? ● EMR clusters are now transient - no more idle clusters Application type Day 1 Day 2 Day 3 Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$ “Streaming” over RDR app 150.08$ 198.73$ 174.68$
  • 29. Did we solve the aforementioned problems? (cont.) ● No more extreme load of Kafka brokers’ disks ○ We still read old messages from Kafka, but now we only read about 1K messages per hour (rather than ~300M) ● The new infra doesn’t depend on the integration of Spark Streaming with Kafka ○ No more weird exceptions... ● We can run the Spark batch applications as (in)frequent as we’d like
  • 30. Summary ● Initially replaced standalone Java with Spark & Scala ○ Still faced CSV-related issues ● Introduced Spark Streaming & Kafka for “stateless” use-cases ○ Quickly needed to handle stateful use-cases as well ● Tried Spark Streaming for stateleful use-cases (via “local” aggregations) ○ Required us to manage the state on our own ● Moved to Structured Streaming (for all use-cases) ○ Cons were mostly related to maturity
  • 31. Summary (cont.) ● Went back to Spark Streaming (with Druid as OLAP) ○ Performance penalty in Kafka for long micro-batches ○ Under-utilized Spark clusters ○ Etc. ● Introduced “streaming” over our Data Lake ○ Eliminated Kafka performance penalty ○ Spark clusters are much better utilized = $$$ saved ○ And more...
  • 32. DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in Big Data. ■ To grow women representation in Big Data field > 25% by 2020 ○ Visit the website (https://www.womeninbigdata.org/) ● Counting Unique Users in Real-Time: Here’s a Challenge for You! ○ Presented yesterday, http://tinyurl.com/yxjc72af ● NMC Tech Blog - https://medium.com/nmc-techblog
  • 36. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts Data stream Unbounded Table New data in the data streamer = New rows appended to a unbounded table Data stream as an unbonded table
  • 37. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  • 38. Structured Streaming - WordCount example https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  • 39. Structured Streaming - basic terms ● Input sources : ○ File ○ Kafka ○ Socket, Rate (for testing) ● Output modes : ○ Append (default) ○ Complete ○ Update (added in Spark 2.1.1) ○ Different types of queries support different output modes ■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table ● Output sinks : ○ File ○ Kafka (added in Spark 2.2.0) ○ Foreach ○ Console, Memory (for debugging) ○ Different types of sinks support different output modes
  • 40. Fault tolerance ● The goal - end-to-end exactly-once semantics ● The means : ○ Trackable sources (i.e offsets) ○ Checkpointing ○ Idempotent sinks
  • 42. Structured Streaming in production So we started moving to Structured Streaming Use case Previous architecture Old flow New architecture New flow Existing Spark app Periodic Spark batch job Read Parquet from S3 -> Transform -> Write Parquet to S3 Stateless Structured Streaming Read from Kafka -> Transform -> Write Parquet to S3 Existing Java app Periodic standalone Java process (“manual” scaling) Read CSV -> Transform and aggregate -> Write to RDBMS Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS New app N/A N/A Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS

Editor's Notes

  • #2: Thank you for coming to hear about our different use-cases of Streaming with Spark and Kafka I will try to make it interesting and valuable for you
  • #3: Questions - at the end of the session
  • #6: Nielsen marketing cloud or NMC in short A group inside Nielsen, Born from exelate company that was acquired by Nielsen on March 2015 Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate Data company meaning Buying and onboarding data into NMC from data providers, customers and Nielsen data We have huge high quality dataset enrich the data using machine learning models in order to create more relevant quality insights categorize and sell according to a need Helping brands to take intelligence business decisions E.g. Targeting in the digital marketing world Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it More interesting to the user More chances he will click the ad Better ROI for the marketer
  • #7: What are the questions we try to answer in NMC that help our customers to take business decisions ? A lot of questions but to lead to what druid is coming to solve Translating from human problem to technical problem: UU (distinct) count Simple count
  • #8: Few words on NMC data pipeline architecture: Frontend layer: Receives all the online and offline data traffic Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC) near real time - high throughput/low latency challenges Backend layer Aws Cloud based process all the frontend layer outputs ETL’s - load data to data sources aggregated and raw Applications layer Also in the cloud Variety of apps above all our data sources Web - NMC data configurations (segments, audiences etc) campaign analysis , campaign management tools etc. visualized profile graphs reports
  • #9: We’ve used Clustrix (our operation DB) as both OLTP and OLAP Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.) Events were written to CSV files Some fields had double quotes, e.g: 2014-07-17,12:55:38,2,2,0,"1619691,9995",1 Processing was done via standalone Java process Had many problems with this architecture Truncated lines in input files Can’t enforce schema Had to “manually” scale the processes
  • #11: Around 2014 the standalone Java processes were transformed into Spark batch jobs written in Scala (but in this presentation we’re going to focus on streaming). This is a simplified version of what we built (simplified it to make it clearer across the presentation) Spark A distributed, scalable engine for large-scale data processing Unified framework for batch, streaming, machine learning, etc Was gaining a lot of popularity in the Big Data community Built on RDDs (Resilient distributed dataset) A fault-tolerant collection of elements that can be operated on in parallel Scala Combines object-oriented and functional programming First-class citizen is Spark
  • #13: Kafka Open-source stream-processing platform Highly scalable Publish/Subscribe (A.K.A pub/sub) Schema enforcement - using Schema Registry and relying on Avro format Much more Originally developed by LinkedIn Graduated from Apache Incubator on late 2012 Quickly became the de facto standard in the industry Today commercial development is led by Confluent Spark Streaming A natural evolvement of our Spark batch job (unified framework – remember?) Introduced the DStream concept Continuous stream of data Represented by a continuous series of RDDs Works in micro-batches Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
  • #14: We started with Spark Streaming over Kafka (in 2015) Our Streaming apps were “stateless” (see below) and running 24/7 : Reading a batch of messages from Kafka Performing simple transformations on each message (no aggregations) Writing the output of each batch to a persistent storage (DB, S3, etc.) Stateful operations (aggregations) were performed periodically in batch either by Spark jobs ETLs in our DB/DWH
  • #16: Looking back, Spark Streaming might have been able to perform stateful operations for us, but (as far as I recall) mapWithState wasn’t available yet, and updateStateByKey had some pending issues. The way to achieve it was : Read messages from Kafka Aggregate the messages of the current micro-batch Increased micro-batch length to achieve a better aggregation ratio Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS) Write the results back to HDFS Every X batches : Update the DB with the aggregated data (some sort of UPSERT) Delete the aggregated files from HDFS UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL) For example, given t1 with columns a (the key) and b (starting from an empty table) INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2 INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
  • #17: In this specific use-case, the app was reading from a topic which had only small amounts of data Required us to manage the state on our own Error-prone E.g what if my cluster is terminated and data on HDFS is lost? Complicates the code Mixed input sources for the same app (Kafka + files) Possible performance impact Might cause the Kafka consumer to lag Obviously not the perfect way (but that’s what we had…)
  • #18: DataFrame/Dataset - rather than DStream’s RDD Catalyst Optimizer - extensible query optimizer which is “at the core of Spark SQL… designed with these key two purposes: Easily add new optimization techniques and features to Spark SQL Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)” (see https://databricks.com/glossary/catalyst-optimizer) Other features included : Handling event-time and late data End-to-end exactly-once fault-tolerance
  • #19: Checkpoint folder is the location Spark stores : The offsets we already read from Kafka The state of the stateful operations (e.g aggregations) We’ve used S3 (via EMRFS) for checkpointing. We’ve deployed to production various use-cases using Structured Streaming : Periodic Spark batch job was converted to a stateless Structured Streaming app Periodic standalone Java app was converted to a stateful Structured Streaming app A brand new app was written as a stateful Structured Streaming app
  • #20: EMRFS consistent view - an optional feature on AWS EMR, allows clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS Checkpointing to S3 wasn’t straight-forward Try using EMRFS consistent view Recommended for stateless apps For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
  • #21: Strengths : Running incremental, continuous processing End-to-end exactly-once fault-tolerance (if you implement it correctly) Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation) Massive efforts are invested in it Weaknesses : Maturity Inability to perform multiple actions on the exact same Dataset E.g http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-Avoiding-multiple-streaming-queries-tt30944.html Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in Spark 2.4, but then you get at-least once)
  • #22: Moved many apps (mostly the ones performing UU counts and “hit” counts) to rely on Druid, which is meant for OLAP, so now : Spark Streaming app Runs on a long-lived EMR cluster (cluster is on 24/7) Performs the “in-batch” aggregation per micro-batch (before writing to S3) Writes relevant metadata to RDS (e.g S3 path) This kind of “split” (i.e persisting the Dataset/DataFrame and iterating it a few times) is impossible with Structured Streaming (where every “branch” of processing is a separate query, at least until https://issues.apache.org/jira/browse/SPARK-24565) M/R ingestion job (loads data into Druid) : Reads relevant metadata from RDS performs the final aggregation (before data is loaded into Druid) Update state in RDS (e.g which files were handled)
  • #23: Screenshot from Ganglia installed on our AWS EMR cluster running the Spark Streaming app Remember - longer micro-batches result in a better aggregation ratio Each such app runs on its own long-lived EMR cluster
  • #24: Extreme load of Kafka brokers’ disks Each micro-batch needs to read ~300M messages , Kafka can’t store it all in memory ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration Forced us to use 1 core per executor to avoid it https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) We wish we could run it even less frequently Remember - longer micro-batches result in a better aggregation ratio
  • #25: Each Kafka topic has its own RDR loader, which stores the data in a separate bucket on S3 (partitioned by date) This means each topic has only 1 consumer (the appropriate RDR loader) RDR loaders use micro-batches of 4-6 minutes, writing about 100 files per micro-batch, each file is ~0.5GB (to allow efficient read) Only simple transformations on each message (no aggregations) Hence no need for long micro-batche
  • #26: Once the RDR loader writes the files to S3, it also writes the files’ paths to a designated topic in Kafka
  • #27: How does that work? Spark batch applications are executed every X hours On each execution : A transient EMR cluster is launched An app consumes the next Y messages from the designated Kafka topic (containing Y paths).Since each such app is consuming from Kafka is a standard way, the offsets (i.e which messages the consumer already read) are committed (and maintained) the same way as we do for any other Kafka consumer Then the app reads those Y paths from RDR and processes them Once done, the EMR cluster is terminated
  • #28: Applications are now batch rather than streaming We no longer use Spark Streaming-Kafka integration, but rather Kafka API (from the driver) to read the files’ paths from the designated Kafka topic Once we got the paths from Kafka, we use the “regular” batch method of reading files, i.e spark.read.parquet After processing has ended, offsets of the messages we read are committed (as we’d do for any Kafka consumer) We now use Airflow (the de facto standard in the industry) to schedule and monitor our batch jobs All this obviously is not meant to be used by apps that require actual real time (say milliseconds)
  • #29: EMR clusters are now transient, so the cluster is terminated as soon as the batch job has finished - no more idle clusters Cost: Spark Streaming cluster is on 24/7, so the cost is fixed With the new infra, the daily cost varies based on the amount of data we processed that day
  • #31: Initially replaced standalone Java with Spark & Scala Solved the scale-related issues but not the CSV-related issues Introduced Spark Streaming & Kafka for “stateless” use-cases Replaced CSV files with Kafka (de facto standard in the industry) Already had Spark batch in production (Spark as a unified framework) Tried Spark Streaming for stateleful use-cases (via “local” aggregations) Not the optimal solution Moved to Structured Streaming (for all use-cases) Pros include : Enables running continuous, incremental processes Built on Spark SQL Cons include : Maturity Inability to perform multiple actions on the exact same Dataset
  • #32: Went back to Spark Streaming Aggregations are done per micro-batch (in Spark) and daily (in Druid) Still not perfect Performance penalty in Kafka for long micro-batches Concurrency issue with Kafka 0.10 consumer in Spark Under-utilized Spark clusters Introduced “streaming” over our Data Lake Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka Spark batch apps read S3 paths from Kafka (and the actual files from S3) Transient EMR clusters Airflow for scheduling and monitoring Pros : Eliminated the performance penalty we had in Kafka Spark clusters are much better utilized = $$$ saved
  • #37: “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended… You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table”
  • #39: “A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.”