Moving Beyond Moving Bytes
Joey Frazee
Suneel Marthi
September 12, 2017
Flink Forward, Berlin, Germany
1
$WhoAreWe
Joey Frazee
 @jfrazee
Product Solutions Architect, Hortonworks
Committer on Apache NiFi, and PMC on Apache Streams
Suneel Marthi
 @suneelmarthi
Principal Software Engineer, Office of Technology, Red Hat
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache
Streams
2
Agenda
What is a Schema Registry?
Why should you Care?
What Exists Today?
Different Wire Formats
Using a Schema Registry
Using a Schema Registry across a Data pipeline
Implementation with Flink Deserialization Schemas
Demo
3
What is a Schema Registry?
4
What is a Schema?
information about a record
field names, field types, default values and aliases
5
A schema registry is:
a centralized, versioned schema repository service
that supports de-centralized schema-based
serialization and deserialization
6
Why should you care?
Because, Real-time stream processing mandates that you know
the semantics of your data:
Interesting operations on streaming
joins, projection, aggregation, filtering, streaming SQL
all require prior knowledge of the types and values of data
Otherwise, you're just moving bytes and counting anonymous things
(you don't need something as powerful as Flink to do that)
7
And, using embedded schemas is an (unnecessary) overhead:
The schema can be larger than data
And it introduces a copy of the schema for every message or
topic
And, including schemas in your project is bad:
Is not recommended b/c it tightly couples the project to your
data
Creates practical scalability issues across system and
application boundaries
8
In general a schema registry offers or implements :
Schema database
Version control strategy
Application API for serialization and deserialization
according to the schema
API service (e.g., REST) for schema management
Way to acquire, include, or pull in binary artifacts
(e.g., SerDes) from the service
Wire format that encodes a schema identifier along
with contents in serialized objects
9
What Exists Today
10
3 Options
Cask Schema Registry
A schema serving layer for Avro and Protobuf to support data preparation
and validation in Cask CDAP Wrangler
Confluent Schema Registry
An interface for storing and retrieving Avro schemas for efficient
serialization in Kafka and interop with Kafka Streams
Hortonworks Registry
Shared repository of schemas and SerDes to support Avro schema sharing,
record processing and serialization in and across applications (e.g., Apache
NiFi, Hortonworks Streaming Analytics Manager)
11
Wire Formats
Cask (N/A?)
Confluent (5 byte header)
Magic byte/protocol version (byte): 0
Schema id (int): 1-4
Hortonworks (13 byte header)
Magic byte/protocol version (byte): 0
Schema id (long): 1-8
Schema version (int): 9-12
12
Feature Comparison
REST
API
Schemas Custom
SerDes
Storage HA UI Maven
Plugin
Schema
Compatibility
Checking
Kafka
Integration
Cask Y Avro,
Protobuf
Cask
CDAP
DataSet
? Y ?
Confluent Y Avro Kafka
Topic
master/slave Y Y Y
Hortonworks Y Avro Y JDBC,
HDFS
storage + load
balancer/proxy
Y
13
Using a Schema Registry
14
Add a New Schema
15
Schema Entry
16
Edit Schema
17
Schema Version
18
Using a Schema Registry across
a Data pipeline
19
Example Data Pipeline
1. Request schema from schema registry service via schema id
2. Receive the associated schema
3. Serialize the message contents according to the schema, packed with the encoded schema metadata, and
publish to Kafka
4. Consume from Kafka and decode the message into its schema metadata and contents
5. Request the schema from schema registry service via schema id
6. Receive the associated schema
7. Deserialize the contents according to the schema and do cool stuff
20
Apache NiFi Twitter Feed Example
21
Schema Access Strategies
Embedded schema:
Whole schema is written out with the message contents (in Avro this
corresponds to DataFileReader/Writer)
Schema metadata reference:
Schema id and other metadata are written as a header with the contents
Implicit schema:
No schema is presented and application must know what it's expecting
or iterate through the universe of possibilities
22
Serialization with Embedded Schema
0000000 O b j 001 002 026 a v r o . s c h e m
0000020 a 232 022 { " t y p e " : " r e c o
0000040 r d " , " n a m e " : " T w e e
0000060 t " , " n a m e s p a c e " : "
0000100 t w i t t e r " , " f i e l d s
0000120 " : [ { " n a m e " : " i d " ,
0000140 " t y p e " : " l o n g " } , {
0000160 " n a m e " : " i d _ s t r " ,
0000200 " t y p e " : " s t r i n g " }
0000220 , { " n a m e " : " t e x t " ,
0000240 " t y p e " : " s t r i n g " }
0000260 , { " n a m e " : " l a n g " ,
0000300 " t y p e " : " s t r i n g " }
0000320 , { " n a m e " : " f a v o r i
0000340 t e _ c o u n t " , " t y p e "
0000360 : " l o n g " } , { " n a m e "
0000400 ...
23
Serialization with Hortonworks Schema
Reference
0000000 001 0 0 0 0 0 0 0 001 0 0 0 001 200 200 ?
0000020 ? 214 204 ? 227 031 $ 9 0 7 3 1 2 6 6 7
0000040 5 8 8 6 8 1 7 2 8 j R T @ B T
0000060 S _ t w t : T h a n k y o u
0000100 ? 230 201 ? ? 217 h t t p s : / /
0000120 t . c o / 8 g w a z v b U J C 004
0000140 e n 0 < M o n S e p 1 1 1
0000160 8 : 3 9 : 3 1 + 0 0 0 0 2 0
0000200 1 7 032 1 5 0 5 1 5 5 1 7 1 6 6 4
0000220 226 ? 225 221 b 024 1 0 9 1 7 4 6 6 9 9
0000240 006 P M Y 022 A n a t i A m i r 002 032
0000260 M a t o k i P l a n e t 0 002 .
0000300 D o n t b l a m e m e , I '
0000320 m w e i r d ? ? n ? f ? 031 0 001
0000340 N 214 ? ? ? 002 022 3 3 5 1 4 1 6 3 8
0000360 016 B T S _ t w t 036 ? ? ? ? 203 204 ?
0000400 206 214 ? 205 204 ? 213 ? 0
24
Serialization with Confluent Schema Reference
0000000 0 0 0 0 Q 230 ? ? ? 201 ? ? 227 031 $ 9
0000020 0 7 3 1 8 0 9 5 0 1 7 9 9 6 2 8
0000040 8 v @ _ _ k i l e y @ o n l y
0000060 s i n w o r l d Y e a h w t
0000100 f ? ? ? T h a t i s s u p
0000120 e r w e i r d ? 237 230 ? 004 e n
0000140 0 < M o n S e p 1 1 1 9 :
0000160 0 1 : 0 5 + 0 0 0 0 2 0 1 7
0000200 032 1 5 0 5 1 5 6 4 6 5 6 6 4 ? ?
0000220 ? 211 216 ? 204 ? 023 $ 7 0 2 0 0 8 8 6
0000240 7 3 6 6 9 3 2 4 8 1 n c y n d i
0000260 030 c y n d a q u i l l l l 022 S a
0000300 n d y , U T 0 > a y o u n g
0000320 m o m l i v i n g i n s
0000340 u b u r b i a . 0 ? a 204 a
25
Convert Record Processor Group
26
ConvertRecord Properties
27
AvroRecordSetWriter Properties without Schema
Registry
28
AvroRecordSetWriter Properties with Schema
Registry
29
PublishKafkaRecord Properties
30
Implementation with Flink
Deserialization Schemas
31
Hortonworks Deserialization Schema
32
Confluent Deserialization Schema
33
Next Steps with Apache Flink
Higher level SerDes for:
Source/Sink
TableSource/TableSink
34
References
Apache NiFi — Records and Schema Registries -
Confluent Schema Registry —
Github —
HortonWorks Schema Registry —
Record-Oriented Data with NiFi —
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-
schema-registries
https://github.com/confluentinc/schema-registry
https://github.com/jfrazee/schema-registry-examples
http://github.com/hortonworks/registry
https://blogs.apache.org/nifi/entry/record-
oriented-data-with-nifi
35
Credits
Bryan Bende — Staff Software Engineer, Hortonworks
and PMC on Apache NiFi
Bruno P. Kinoshita — PMC on Apache OpenNLP and
Apache Commons
36
Questions ???
37

More Related Content

PDF
Texconf11
PDF
MariaDB: ANALYZE for statements (lightning talk)
PDF
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
PDF
Avro Data | Washington DC HUG
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PPTX
Schema Registry - Set Your Data Free
PPTX
Schema Registry - Set you Data Free
PDF
When Kafka Is the Source of Truth With Ricardo Ferreira | Current 2022
Texconf11
MariaDB: ANALYZE for statements (lightning talk)
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Avro Data | Washington DC HUG
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Schema Registry - Set Your Data Free
Schema Registry - Set you Data Free
When Kafka Is the Source of Truth With Ricardo Ferreira | Current 2022

Similar to Flink Forward Berlin 2017: Joey Frazee, Suneel Marthi - Moving Beyond Moving Bytes (20)

PPTX
Evolving Streaming Applications
PDF
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
PPTX
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
PDF
Doug Cutting on the State of the Hadoop Ecosystem
PDF
3 avro hug-2010-07-21
PPTX
A 3 dimensional data model in hbase for large time-series dataset-20120915
PDF
Schema management with Scalameta
PDF
Fletcher Framework for Programming FPGA
PDF
From bytes to objects: describing your events | Dale Lane and Kate Stanley, IBM
PDF
ApacheCon09: Avro
PDF
Apache avro data serialization framework
PDF
Hw09 Next Steps For Hadoop
PPTX
Avro intro
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PDF
Apache big data 2016 - Speaking the language of Big Data
PDF
Schemas Beyond The Edge
PDF
How to use Parquet as a basis for ETL and analytics
PDF
On Rails with Apache Cassandra
PDF
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Evolving Streaming Applications
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
Schema on read is obsolete. Welcome metaprogramming..pdf
Doug Cutting on the State of the Hadoop Ecosystem
3 avro hug-2010-07-21
A 3 dimensional data model in hbase for large time-series dataset-20120915
Schema management with Scalameta
Fletcher Framework for Programming FPGA
From bytes to objects: describing your events | Dale Lane and Kate Stanley, IBM
ApacheCon09: Avro
Apache avro data serialization framework
Hw09 Next Steps For Hadoop
Avro intro
HBase Data Modeling and Access Patterns with Kite SDK
Apache big data 2016 - Speaking the language of Big Data
Schemas Beyond The Edge
How to use Parquet as a basis for ETL and analytics
On Rails with Apache Cassandra
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg
Ad

Recently uploaded (20)

PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Business_Capability_Map_Collection__pptx
PPT
Image processing and pattern recognition 2.ppt
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
Microsoft 365 products and services descrption
PPTX
SET 1 Compulsory MNH machine learning intro
DOCX
Factor Analysis Word Document Presentation
PDF
Global Data and Analytics Market Outlook Report
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
Machine Learning and working of machine Learning
PPTX
recommendation Project PPT with details attached
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
chrmotography.pptx food anaylysis techni
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
statsppt this is statistics ppt for giving knowledge about this topic
Session 11 - Data Visualization Storytelling (2).pdf
DU, AIS, Big Data and Data Analytics.ppt
Business_Capability_Map_Collection__pptx
Image processing and pattern recognition 2.ppt
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Microsoft 365 products and services descrption
SET 1 Compulsory MNH machine learning intro
Factor Analysis Word Document Presentation
Global Data and Analytics Market Outlook Report
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Best Data Science Professional Certificates in the USA | IABAC
Machine Learning and working of machine Learning
recommendation Project PPT with details attached
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
chrmotography.pptx food anaylysis techni
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx

Flink Forward Berlin 2017: Joey Frazee, Suneel Marthi - Moving Beyond Moving Bytes

  • 1. Moving Beyond Moving Bytes Joey Frazee Suneel Marthi September 12, 2017 Flink Forward, Berlin, Germany 1
  • 2. $WhoAreWe Joey Frazee  @jfrazee Product Solutions Architect, Hortonworks Committer on Apache NiFi, and PMC on Apache Streams Suneel Marthi  @suneelmarthi Principal Software Engineer, Office of Technology, Red Hat Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. Agenda What is a Schema Registry? Why should you Care? What Exists Today? Different Wire Formats Using a Schema Registry Using a Schema Registry across a Data pipeline Implementation with Flink Deserialization Schemas Demo 3
  • 4. What is a Schema Registry? 4
  • 5. What is a Schema? information about a record field names, field types, default values and aliases 5
  • 6. A schema registry is: a centralized, versioned schema repository service that supports de-centralized schema-based serialization and deserialization 6
  • 7. Why should you care? Because, Real-time stream processing mandates that you know the semantics of your data: Interesting operations on streaming joins, projection, aggregation, filtering, streaming SQL all require prior knowledge of the types and values of data Otherwise, you're just moving bytes and counting anonymous things (you don't need something as powerful as Flink to do that) 7
  • 8. And, using embedded schemas is an (unnecessary) overhead: The schema can be larger than data And it introduces a copy of the schema for every message or topic And, including schemas in your project is bad: Is not recommended b/c it tightly couples the project to your data Creates practical scalability issues across system and application boundaries 8
  • 9. In general a schema registry offers or implements : Schema database Version control strategy Application API for serialization and deserialization according to the schema API service (e.g., REST) for schema management Way to acquire, include, or pull in binary artifacts (e.g., SerDes) from the service Wire format that encodes a schema identifier along with contents in serialized objects 9
  • 11. 3 Options Cask Schema Registry A schema serving layer for Avro and Protobuf to support data preparation and validation in Cask CDAP Wrangler Confluent Schema Registry An interface for storing and retrieving Avro schemas for efficient serialization in Kafka and interop with Kafka Streams Hortonworks Registry Shared repository of schemas and SerDes to support Avro schema sharing, record processing and serialization in and across applications (e.g., Apache NiFi, Hortonworks Streaming Analytics Manager) 11
  • 12. Wire Formats Cask (N/A?) Confluent (5 byte header) Magic byte/protocol version (byte): 0 Schema id (int): 1-4 Hortonworks (13 byte header) Magic byte/protocol version (byte): 0 Schema id (long): 1-8 Schema version (int): 9-12 12
  • 13. Feature Comparison REST API Schemas Custom SerDes Storage HA UI Maven Plugin Schema Compatibility Checking Kafka Integration Cask Y Avro, Protobuf Cask CDAP DataSet ? Y ? Confluent Y Avro Kafka Topic master/slave Y Y Y Hortonworks Y Avro Y JDBC, HDFS storage + load balancer/proxy Y 13
  • 14. Using a Schema Registry 14
  • 15. Add a New Schema 15
  • 19. Using a Schema Registry across a Data pipeline 19
  • 20. Example Data Pipeline 1. Request schema from schema registry service via schema id 2. Receive the associated schema 3. Serialize the message contents according to the schema, packed with the encoded schema metadata, and publish to Kafka 4. Consume from Kafka and decode the message into its schema metadata and contents 5. Request the schema from schema registry service via schema id 6. Receive the associated schema 7. Deserialize the contents according to the schema and do cool stuff 20
  • 21. Apache NiFi Twitter Feed Example 21
  • 22. Schema Access Strategies Embedded schema: Whole schema is written out with the message contents (in Avro this corresponds to DataFileReader/Writer) Schema metadata reference: Schema id and other metadata are written as a header with the contents Implicit schema: No schema is presented and application must know what it's expecting or iterate through the universe of possibilities 22
  • 23. Serialization with Embedded Schema 0000000 O b j 001 002 026 a v r o . s c h e m 0000020 a 232 022 { " t y p e " : " r e c o 0000040 r d " , " n a m e " : " T w e e 0000060 t " , " n a m e s p a c e " : " 0000100 t w i t t e r " , " f i e l d s 0000120 " : [ { " n a m e " : " i d " , 0000140 " t y p e " : " l o n g " } , { 0000160 " n a m e " : " i d _ s t r " , 0000200 " t y p e " : " s t r i n g " } 0000220 , { " n a m e " : " t e x t " , 0000240 " t y p e " : " s t r i n g " } 0000260 , { " n a m e " : " l a n g " , 0000300 " t y p e " : " s t r i n g " } 0000320 , { " n a m e " : " f a v o r i 0000340 t e _ c o u n t " , " t y p e " 0000360 : " l o n g " } , { " n a m e " 0000400 ... 23
  • 24. Serialization with Hortonworks Schema Reference 0000000 001 0 0 0 0 0 0 0 001 0 0 0 001 200 200 ? 0000020 ? 214 204 ? 227 031 $ 9 0 7 3 1 2 6 6 7 0000040 5 8 8 6 8 1 7 2 8 j R T @ B T 0000060 S _ t w t : T h a n k y o u 0000100 ? 230 201 ? ? 217 h t t p s : / / 0000120 t . c o / 8 g w a z v b U J C 004 0000140 e n 0 < M o n S e p 1 1 1 0000160 8 : 3 9 : 3 1 + 0 0 0 0 2 0 0000200 1 7 032 1 5 0 5 1 5 5 1 7 1 6 6 4 0000220 226 ? 225 221 b 024 1 0 9 1 7 4 6 6 9 9 0000240 006 P M Y 022 A n a t i A m i r 002 032 0000260 M a t o k i P l a n e t 0 002 . 0000300 D o n t b l a m e m e , I ' 0000320 m w e i r d ? ? n ? f ? 031 0 001 0000340 N 214 ? ? ? 002 022 3 3 5 1 4 1 6 3 8 0000360 016 B T S _ t w t 036 ? ? ? ? 203 204 ? 0000400 206 214 ? 205 204 ? 213 ? 0 24
  • 25. Serialization with Confluent Schema Reference 0000000 0 0 0 0 Q 230 ? ? ? 201 ? ? 227 031 $ 9 0000020 0 7 3 1 8 0 9 5 0 1 7 9 9 6 2 8 0000040 8 v @ _ _ k i l e y @ o n l y 0000060 s i n w o r l d Y e a h w t 0000100 f ? ? ? T h a t i s s u p 0000120 e r w e i r d ? 237 230 ? 004 e n 0000140 0 < M o n S e p 1 1 1 9 : 0000160 0 1 : 0 5 + 0 0 0 0 2 0 1 7 0000200 032 1 5 0 5 1 5 6 4 6 5 6 6 4 ? ? 0000220 ? 211 216 ? 204 ? 023 $ 7 0 2 0 0 8 8 6 0000240 7 3 6 6 9 3 2 4 8 1 n c y n d i 0000260 030 c y n d a q u i l l l l 022 S a 0000300 n d y , U T 0 > a y o u n g 0000320 m o m l i v i n g i n s 0000340 u b u r b i a . 0 ? a 204 a 25
  • 34. Next Steps with Apache Flink Higher level SerDes for: Source/Sink TableSource/TableSink 34
  • 35. References Apache NiFi — Records and Schema Registries - Confluent Schema Registry — Github — HortonWorks Schema Registry — Record-Oriented Data with NiFi — https://bryanbende.com/development/2017/06/20/apache-nifi-records-and- schema-registries https://github.com/confluentinc/schema-registry https://github.com/jfrazee/schema-registry-examples http://github.com/hortonworks/registry https://blogs.apache.org/nifi/entry/record- oriented-data-with-nifi 35
  • 36. Credits Bryan Bende — Staff Software Engineer, Hortonworks and PMC on Apache NiFi Bruno P. Kinoshita — PMC on Apache OpenNLP and Apache Commons 36