SlideShare a Scribd company logo
Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
Building a Streaming Microservice
Architecture: With Spark
Structured Streaming and Friends
Scott Haines
Senior Principal Software Engineer
Introductions
▪ I work at Twilio
▪ Over 10 years working on Streaming
Architectures
▪ Helped Bring Streaming-First Spark Architecture
to Voice & Voice Insights
▪ Leads Spark Office Hours @ Twilio
▪ Loves Distributed Systems
About Me
Scott Haines: Senior Principal Software Engineer @newfront
Agenda
The Big Picture
What the Architecture looks like
Protocol Buffers
What they are. Why they rule!
GRPC / Protocol Streams
Versioned Data Lineage as a Service
How this fits into Spark
Structured Streaming with Protobuf support
The Big Picture
Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
Streaming Microservice Architecture
Kafka Topic Kafka Topic
Spark Application Spark Application Spark Application
Kafka Topic
Data Table Data Table
Spark Application
GRPC Server
Protocol Buffers aka protobuf
Protocol Buffers
▪ Strict Types
▪ Enforce structure at compile time
▪ Similar to StructType in Apache Spark
▪ Interoperable with Spark via ExpressionEncoding extension
▪ Versioning API / Data Pipeline
▪ Compiled protobuf (*.proto) can be released like normal code
▪ Interoperable
▪ Pick your favorite programming language and compile and release.
▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more
Why use them?
Protocol Buffers
▪ Code Gen
▪ Automatically generate Builder classes
▪ Being lazy is okay!
▪ Optimized
▪ Messages are optimized and ship with their own
Serialization/Deserialization mechanics (SerDe)
Why use them?
GRPC and Protocol Streams
gRPC
▪ High Performance
▪ Compact Binary Exchange Format
▪ Make API Calls to the Server like they were Client local
▪ Cross Language/Cross Platform
▪ Autogenerate API definitions for idiomatic client and server – just
implement the interfaces
▪ Bi-Directional Streaming
▪ Pluggable support for streaming with HTTP/2 transport
What is it?
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
GRPC Example: AdTracking
GRPC
▪ Define Messages
▪ What kind of Data are your sending?
▪ Example: Click Tracking / Impression Tracking
▪ What is necessary for the public interface?
▪ Example: AdImpression and Response
How it works?
GRPC
▪ Service Definition
▪ Compile your rpc definition to generate Service Interfaces
▪ Uses the Same protobuf definition (service.proto) as your
Client/Server request and response objects
▪ Can be used to create a binding Service Contract within your
organization or publicly
How it works?
GRPC
▪ Implement the Service
▪ Compilation of the Service auto-generates your
interfaces.
▪ Just implement the service contracts.
How it works?
GRPC
▪ Protocol Streams
▪ Messages (protobuf) are emitted to Kafka topic(s)
from the Server Layer
▪ Protocol Streams are now available from the Kafka
Topics bound to a given Service / Collection of
Messages
▪ Sets up Spark for the Hand-Off
How it works?
GRPC
System Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
Kafka Broker
Kafka Broker
6
HTTP /2
Topic: ads.click.stream
Client: service.adTrack(trackedAd)
Server: ClickTrackService.adTrack(trackedAd)
Structuring Protocol Streams:
with Structured Streaming
and protobuf
Structured Streaming with Protobuf
▪ Expression Encoding
▪ Natively Interop with Protobuf in Apache Spark.
▪ Protobuf to Case Class conversion from
scalapb.
▪ Product encoding comes for free via import
sparkSession.implicits._
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Native is Better
▪ Strict Native Kafka to DataFrame conversion with no need
for transformation to intermediary types
▪ Mutations and Joins can be done across DataFrame or
Datasets API.
▪ Create RealTime Data Pipelines, Machine Learning
Pipelines and More.
▪ Rest at Night knowing the pipelines are safe!
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Strict Data Writer
▪ Compiled / Versioned Protobuf can be used to strictly
enforce the format of your Writers even
▪ Use Protobuf to define the StructType that can be used in
your conversions to *Parquet. (* must abide by parquet
nesting rules )
▪ Declarative Input / Output means that Streaming
Applications don’t go down due to incompatible Data
Streams
▪ Can also be used with Delta so that the version of the
schema lines up with compiled Protobuf.
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Real World Use Case
▪ Close of Books Data Lineage Job
▪ Uses End to End Protobuf
▪ Enables teams to move quick with guarantees regarding
the Data being published and at what Frequency
▪ Can be emitted at different speeds to different locations
based on configuration
Example: Streaming Transformation Pipeline
Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
Recap
What We Learned
▪ Language
Agnostic
Structured Data
▪ Compile Time
Guarantees
▪ Lightning Fast
Serialization/Dese
rialization
▪ Language
Agnostic Binary
Services
▪ Low-Latency
▪ Compile Time
Guarantees
▪ Smart Framework
GRPCProtobuf
▪ Highly Available
▪ Native Connector
for Spark
▪ Topic Based Binary
Protobuf Store
▪ Use to Pass
Records to one or
more Downstream
Services
Kafka
▪ Handle Data
Reliably
▪ Protobuf to
Dataset /
DataFrames is
awesome
▪ Parquet / Delta
plays nice as
Columnar Data
Exchange format
Structured Streaming
Thanks @newfrontcreative
@newfront
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Parquet performance tuning: the missing guide
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
Understanding Memory Management In Spark For Fun And Profit
Introduction to Apache Flink - Fast and reliable big data processing
Parquet performance tuning: the missing guide
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

What's hot (20)

PDF
Optimizing Hive Queries
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Apache Arrow Flight Overview
PPTX
Real-Time Data Flows with Apache NiFi
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
Analyzing Historical Data of Applications on YARN for Fun and Profit
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
PPTX
Node Labels in YARN
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
Ray: Enterprise-Grade, Distributed Python
PPTX
Apache Kafka at LinkedIn
PDF
Presto on YARNの導入・運用
PDF
Spark shuffle introduction
PDF
Event-Driven Architecture (EDA)
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PPTX
Introduction to Apache Flink
Optimizing Hive Queries
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache Arrow Flight Overview
Real-Time Data Flows with Apache NiFi
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Top 5 Mistakes When Writing Spark Applications
Analyzing Historical Data of Applications on YARN for Fun and Profit
Building Robust ETL Pipelines with Apache Spark
How Uber scaled its Real Time Infrastructure to Trillion events per day
Node Labels in YARN
Deep Dive: Memory Management in Apache Spark
Introducing the Apache Flink Kubernetes Operator
Ray: Enterprise-Grade, Distributed Python
Apache Kafka at LinkedIn
Presto on YARNの導入・運用
Spark shuffle introduction
Event-Driven Architecture (EDA)
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
HBase and HDFS: Understanding FileSystem Usage in HBase
Introduction to Apache Flink
Ad

Similar to Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends (20)

PDF
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
PDF
REST in Peace. Long live gRPC!
PPTX
CocoaConf: The Language of Mobile Software is APIs
PDF
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
PPTX
Building your First gRPC Service
PDF
Power-up services with gRPC
PDF
Inter-Process Communication in Microservices using gRPC
PDF
Implementing OpenAPI and GraphQL services with gRPC
PDF
Fast and Reliable Swift APIs with gRPC
PPTX
Building API Using GRPC And Scala
PPTX
The Right Kind of API – How To Choose Appropriate API Protocols and Data Form...
PPTX
Introduction to gRPC. Advantages and Disadvantages
PDF
Building Language Agnostic APIs with gRPC - JavaDay Istanbul 2017
PPTX
Akka gRPC Essentials A Hands-On Introduction
PPTX
What I learned about APIs in my first year at Google
PDF
Building Microservices with gRPC and NATS
PPTX
Mcroservices with docker kubernetes, goang and grpc, overview
PDF
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
PDF
Cloud native IPC for Microservices Workshop @ Containerdays 2022
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
REST in Peace. Long live gRPC!
CocoaConf: The Language of Mobile Software is APIs
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
Building your First gRPC Service
Power-up services with gRPC
Inter-Process Communication in Microservices using gRPC
Implementing OpenAPI and GraphQL services with gRPC
Fast and Reliable Swift APIs with gRPC
Building API Using GRPC And Scala
The Right Kind of API – How To Choose Appropriate API Protocols and Data Form...
Introduction to gRPC. Advantages and Disadvantages
Building Language Agnostic APIs with gRPC - JavaDay Istanbul 2017
Akka gRPC Essentials A Hands-On Introduction
What I learned about APIs in my first year at Google
Building Microservices with gRPC and NATS
Mcroservices with docker kubernetes, goang and grpc, overview
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
Cloud native IPC for Microservices Workshop @ Containerdays 2022
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Introduction to Business Data Analytics.
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
1_Introduction to advance data techniques.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Quality review (1)_presentation of this 21
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Launch Your Data Science Career in Kochi – 2025
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Business Data Analytics.
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
1_Introduction to advance data techniques.pptx
Foundation of Data Science unit number two notes
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Moving the Public Sector (Government) to a Digital Adoption
Quality review (1)_presentation of this 21
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends

  • 2. Building a Streaming Microservice Architecture: With Spark Structured Streaming and Friends Scott Haines Senior Principal Software Engineer
  • 3. Introductions ▪ I work at Twilio ▪ Over 10 years working on Streaming Architectures ▪ Helped Bring Streaming-First Spark Architecture to Voice & Voice Insights ▪ Leads Spark Office Hours @ Twilio ▪ Loves Distributed Systems About Me Scott Haines: Senior Principal Software Engineer @newfront
  • 4. Agenda The Big Picture What the Architecture looks like Protocol Buffers What they are. Why they rule! GRPC / Protocol Streams Versioned Data Lineage as a Service How this fits into Spark Structured Streaming with Protobuf support
  • 6. Streaming Microservice Architecture GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2
  • 7. Streaming Microservice Architecture Kafka Topic Kafka Topic Spark Application Spark Application Spark Application Kafka Topic Data Table Data Table Spark Application GRPC Server
  • 9. Protocol Buffers ▪ Strict Types ▪ Enforce structure at compile time ▪ Similar to StructType in Apache Spark ▪ Interoperable with Spark via ExpressionEncoding extension ▪ Versioning API / Data Pipeline ▪ Compiled protobuf (*.proto) can be released like normal code ▪ Interoperable ▪ Pick your favorite programming language and compile and release. ▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more Why use them?
  • 10. Protocol Buffers ▪ Code Gen ▪ Automatically generate Builder classes ▪ Being lazy is okay! ▪ Optimized ▪ Messages are optimized and ship with their own Serialization/Deserialization mechanics (SerDe) Why use them?
  • 11. GRPC and Protocol Streams
  • 12. gRPC ▪ High Performance ▪ Compact Binary Exchange Format ▪ Make API Calls to the Server like they were Client local ▪ Cross Language/Cross Platform ▪ Autogenerate API definitions for idiomatic client and server – just implement the interfaces ▪ Bi-Directional Streaming ▪ Pluggable support for streaming with HTTP/2 transport What is it? GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2
  • 14. GRPC ▪ Define Messages ▪ What kind of Data are your sending? ▪ Example: Click Tracking / Impression Tracking ▪ What is necessary for the public interface? ▪ Example: AdImpression and Response How it works?
  • 15. GRPC ▪ Service Definition ▪ Compile your rpc definition to generate Service Interfaces ▪ Uses the Same protobuf definition (service.proto) as your Client/Server request and response objects ▪ Can be used to create a binding Service Contract within your organization or publicly How it works?
  • 16. GRPC ▪ Implement the Service ▪ Compilation of the Service auto-generates your interfaces. ▪ Just implement the service contracts. How it works?
  • 17. GRPC ▪ Protocol Streams ▪ Messages (protobuf) are emitted to Kafka topic(s) from the Server Layer ▪ Protocol Streams are now available from the Kafka Topics bound to a given Service / Collection of Messages ▪ Sets up Spark for the Hand-Off How it works?
  • 18. GRPC System Architecture GRPC Client GRPC Server GRPC Server GRPC Server Kafka Broker Kafka Broker 6 HTTP /2 Topic: ads.click.stream Client: service.adTrack(trackedAd) Server: ClickTrackService.adTrack(trackedAd)
  • 19. Structuring Protocol Streams: with Structured Streaming and protobuf
  • 20. Structured Streaming with Protobuf ▪ Expression Encoding ▪ Natively Interop with Protobuf in Apache Spark. ▪ Protobuf to Case Class conversion from scalapb. ▪ Product encoding comes for free via import sparkSession.implicits._ From Protocol Buffer to StructType through ExpressionEncoders
  • 21. Structured Streaming with Protobuf ▪ Native is Better ▪ Strict Native Kafka to DataFrame conversion with no need for transformation to intermediary types ▪ Mutations and Joins can be done across DataFrame or Datasets API. ▪ Create RealTime Data Pipelines, Machine Learning Pipelines and More. ▪ Rest at Night knowing the pipelines are safe! From Protocol Buffer to StructType through ExpressionEncoders
  • 22. Structured Streaming with Protobuf ▪ Strict Data Writer ▪ Compiled / Versioned Protobuf can be used to strictly enforce the format of your Writers even ▪ Use Protobuf to define the StructType that can be used in your conversions to *Parquet. (* must abide by parquet nesting rules ) ▪ Declarative Input / Output means that Streaming Applications don’t go down due to incompatible Data Streams ▪ Can also be used with Delta so that the version of the schema lines up with compiled Protobuf. From Protocol Buffer to StructType through ExpressionEncoders
  • 23. Structured Streaming with Protobuf ▪ Real World Use Case ▪ Close of Books Data Lineage Job ▪ Uses End to End Protobuf ▪ Enables teams to move quick with guarantees regarding the Data being published and at what Frequency ▪ Can be emitted at different speeds to different locations based on configuration Example: Streaming Transformation Pipeline
  • 24. Streaming Microservice Architecture GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2
  • 25. Recap
  • 26. What We Learned ▪ Language Agnostic Structured Data ▪ Compile Time Guarantees ▪ Lightning Fast Serialization/Dese rialization ▪ Language Agnostic Binary Services ▪ Low-Latency ▪ Compile Time Guarantees ▪ Smart Framework GRPCProtobuf ▪ Highly Available ▪ Native Connector for Spark ▪ Topic Based Binary Protobuf Store ▪ Use to Pass Records to one or more Downstream Services Kafka ▪ Handle Data Reliably ▪ Protobuf to Dataset / DataFrames is awesome ▪ Parquet / Delta plays nice as Columnar Data Exchange format Structured Streaming
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.