SlideShare a Scribd company logo
Schemas Beyond The Edge
Alexei Zenin
Platform Engineer, Uken Games
August 17th, 2021
Journey
Mobile Analytics at Uken Games
Key Features of Schema Registry
Revamped Mobile Analytics Pipeline: JSON to Protobuf
2
Mobile Analytics at Uken Games
Collection of data to drive:
● Retention & engagement analysis (AB tests, DAU)
● Operational debugging & investigation
(purchases, rewards)
● Lower-level metric analysis (HTTP, CPU, Memory)
Focus will be on the first two
3
Current Ingestion Pipeline Design
4
~200 million events per day
Schema Silos
5
● Each silo has its own processes and
tools
● Several teams managing the same data
asset definitions
● Duplication of work across the
boundaries
○ Structure
○ Data types
○ Naming
6
Schema Drift
Pros & Cons
Pros
● Adding a new event is quick
● JSON is easy to use
● JSON well supported across systems
Cons
● Duplicated effort across teams for data management
● Repeated schema definitions
● Manual processes to keep things in sync
● JSON retransmits schema information each time
7
Schema Registry Refresher
8
Overview of Schema Registry
● Provides a central place to upload your
schemas
● Immutable & Idempotent API
● Acts as “barcode system”
9
https://bit.ly/3ikXUci
Default Wire Format (Confluent)
10
Single Primary Architecture
11
● Primary is elected via
Kafka Group Protocol
● Each Secondary is informed of
the primary’s address
● Primary is responsible for writes
(e.g. registering new schemas)
● Every node can serve reads or
forward write requests
Ingestion Pipeline 2.0: Protobuf
12
Ingestion Pipeline 2.0
Goals:
● Decrease boilerplate work
● Leverage automation
● Increase transparency
● Single source of truth for schemas
Solution:
Gitops paradigm with Protobuf & Schema Registry
13
JSON to Protobuf: Schema Structures
● Legacy JSON schema structure use the concept of an envelope with various subdivisions
● Convert into Protobuf, preserve schema structure
14
Protobuf Equivalent
Protobuf features:
● Can express custom types and use
composition to glue together
schemas in 1 top level schema
● Has support for some native types
like google.protobuf.Timestamp
● Ability to generate code from schema
(e.g. Java, C#)
15
Envelope per payload?
● Uken has between 200-300 event
payloads per game
● One approach is to copy paste the
envelope per payload per game
Envelopes = Payloads X Games
● Downsides are duplication in schemas
and generated code
● Leads to a poor developer experience
16
“Oneof” Branching
● Try using the oneof construct to
enumerate all possible payloads at
the EventPayload level
● Only need an envelope per game or 1
mega envelope for all games
Envelopes = O(Games)
● Similar to Avro Unions
● Still has duplication of envelope and
EventPayload
17
How to get around strict Protobuf schemas
● Each definition needs to be defined upfront and explicitly in Protobuf
Solution:
Defer schema attached until runtime to be able to reuse envelope across games
18
Protobuf’s Any: Dynamic Message Container
● Allows you to use any embedded
Protobuf type
● Uses special “packing” code
● The type_url only accepts Protobuf
package names
● Requires class to be present in
application
19
https://bit.ly/2Uo6RcS
Schema Registry Compatible “Any”
● Enables attaching schemas at runtime
● Removes type_url for schema_id
● The value field is Proto3 encoded bytes
20
Dynamic Envelope
● Can define one envelope for
all games
● Use the AnyUken type for
EventPayload
● Tradeoff explicitness for
flexibility
● Elevate schema IDs to a first
class concept within the
schemas themselves
(schema pointers)
21
22
Protobuf View
How do we get these schemas
past the edge?
23
Integrating Schema Registry with Mobile clients
Problems:
● Generated Protobuf classes are not aware of schema IDs
● Client device needs access to schema IDs
24
“Online” Approach (traditional)
25
● Expose Schema Registry (SR) to
mobile clients directly
● Fetch required schema IDs
during app runtime
Disadvantages:
● Need to setup security for
schema registry
● Need to scale SR to millions of
clients
● Point of failure for client
● Adds network overhead to
client
“Embedded” Approach
26
Embedded Tradeoffs
● No need to expose SR to millions of clients
● Leverage the immutable property of schema
IDs
● Custom build process
● Total snapshot of schema IDs built into app
27
Serverless + Schema Registry
28
EventBatch wire format
29
● Define a generic envelope as the API contract
between API Gateway & Mobile client
● Able to take any resolvable schema, return
error if bad data
● Avoids hardcoding a specific version of
AnalyticsEvent
● Keeps wire format compliant to Proto3
Nesting Schema IDs Visualized
30
Performance Comparison
31
● 3x improvement in latency to ingest a
batch of events
● 2x reduction in size per event
● 2x increase in possible number of
events buffered on device
JSON Latency
Protobuf Latency
Schema Management: GitOps paradigm
● Version control Protobuf schemas in Git monorepo
● Use Merge Requests to collaborate on impending changes
● Run CI/CD on commits
● Place people into the right spots, let automation do the rest
32
33
34
Next Steps
● Adding a Data Dictionary
● Improving visibility with integrations to Slack
● Iterating on RACI matrix for data management
● Migrating Spark Jobs to utilize new data management process
35
Takeaways
● GitOps allows for data management
automation
● Schema Registry can empower devices outside
the data center
● Schema IDs allow flexible envelope designs that
operate better at scale
● Protobuf can help reduce costs by several times

More Related Content

PDF
Redis - for duplicate detection on real time stream
PPTX
Zookeeper Tutorial for beginners
PPTX
Stability Patterns for Microservices
PPTX
InfluxDB Roadmap: What’s New and What’s Coming
PDF
Data pipeline with kafka
PPTX
High throughput qPCR: tips for analysis across multiple plates
PDF
Airflow at lyft for Airflow summit 2020 conference
PDF
Understanding InfluxDB’s New Storage Engine
Redis - for duplicate detection on real time stream
Zookeeper Tutorial for beginners
Stability Patterns for Microservices
InfluxDB Roadmap: What’s New and What’s Coming
Data pipeline with kafka
High throughput qPCR: tips for analysis across multiple plates
Airflow at lyft for Airflow summit 2020 conference
Understanding InfluxDB’s New Storage Engine

What's hot (20)

PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PDF
Introduction to Apache Beam
PPT
Genome walking – a new strategy for identification of nucleotide sequence in ...
PDF
Introducing Apache Airflow and how we are using it
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
PDF
Airflow introduction
PPTX
Agile methodology
PDF
A Kafka journey and why migrate to Confluent Cloud?
PPTX
SRE-iously! Reliability!
PPTX
Autoscaling Flink with Reactive Mode
PPTX
Apache airflow
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
Introduction agile scrum methodology
PDF
Rally - How to use it
PPTX
Apache Flink Training: System Overview
PDF
The Patterns of Distributed Logging and Containers
PDF
Kafka At Scale in the Cloud
Apache Iceberg - A Table Format for Hige Analytic Datasets
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Introduction to Apache Beam
Genome walking – a new strategy for identification of nucleotide sequence in ...
Introducing Apache Airflow and how we are using it
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Airflow introduction
Agile methodology
A Kafka journey and why migrate to Confluent Cloud?
SRE-iously! Reliability!
Autoscaling Flink with Reactive Mode
Apache airflow
File Format Benchmark - Avro, JSON, ORC and Parquet
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Introduction agile scrum methodology
Rally - How to use it
Apache Flink Training: System Overview
The Patterns of Distributed Logging and Containers
Kafka At Scale in the Cloud
Ad

Similar to Schemas Beyond The Edge (20)

PPTX
Node.js Web Apps @ ebay scale
PDF
ARISE
PDF
Building Kick Ass Video Games for the Cloud
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
PDF
Node.js Presentation
PDF
RPC in Smalltalk
PDF
Seminario eMadrid 2015 09 10 sobre Serious Games (UCM) Manuel Freire - RAGE:...
PDF
Machine Learning Infrastructure
PDF
LCU14 310- Cisco ODP v2
PDF
mloc.js 2014 - JavaScript and the browser as a platform for game development
PDF
Mobile game architecture on GCP
ODP
Zero Downtime JEE Architectures
PPTX
COM+ & MSMQ
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
Electron JS | Build cross-platform desktop applications with web technologies
DOCX
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
PDF
Unite2014 Bunny Necropsy - Servers, Syncing Game State, Security and Optimiza...
PPTX
Ultimate Guide to Microservice Architecture on Kubernetes
PDF
Rashmi_Resume
Node.js Web Apps @ ebay scale
ARISE
Building Kick Ass Video Games for the Cloud
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Node.js Presentation
RPC in Smalltalk
Seminario eMadrid 2015 09 10 sobre Serious Games (UCM) Manuel Freire - RAGE:...
Machine Learning Infrastructure
LCU14 310- Cisco ODP v2
mloc.js 2014 - JavaScript and the browser as a platform for game development
Mobile game architecture on GCP
Zero Downtime JEE Architectures
COM+ & MSMQ
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Electron JS | Build cross-platform desktop applications with web technologies
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Unite2014 Bunny Necropsy - Servers, Syncing Game State, Security and Optimiza...
Ultimate Guide to Microservice Architecture on Kubernetes
Rashmi_Resume
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Schemas Beyond The Edge

  • 1. Schemas Beyond The Edge Alexei Zenin Platform Engineer, Uken Games August 17th, 2021
  • 2. Journey Mobile Analytics at Uken Games Key Features of Schema Registry Revamped Mobile Analytics Pipeline: JSON to Protobuf 2
  • 3. Mobile Analytics at Uken Games Collection of data to drive: ● Retention & engagement analysis (AB tests, DAU) ● Operational debugging & investigation (purchases, rewards) ● Lower-level metric analysis (HTTP, CPU, Memory) Focus will be on the first two 3
  • 4. Current Ingestion Pipeline Design 4 ~200 million events per day
  • 5. Schema Silos 5 ● Each silo has its own processes and tools ● Several teams managing the same data asset definitions ● Duplication of work across the boundaries ○ Structure ○ Data types ○ Naming
  • 7. Pros & Cons Pros ● Adding a new event is quick ● JSON is easy to use ● JSON well supported across systems Cons ● Duplicated effort across teams for data management ● Repeated schema definitions ● Manual processes to keep things in sync ● JSON retransmits schema information each time 7
  • 9. Overview of Schema Registry ● Provides a central place to upload your schemas ● Immutable & Idempotent API ● Acts as “barcode system” 9 https://bit.ly/3ikXUci
  • 10. Default Wire Format (Confluent) 10
  • 11. Single Primary Architecture 11 ● Primary is elected via Kafka Group Protocol ● Each Secondary is informed of the primary’s address ● Primary is responsible for writes (e.g. registering new schemas) ● Every node can serve reads or forward write requests
  • 12. Ingestion Pipeline 2.0: Protobuf 12
  • 13. Ingestion Pipeline 2.0 Goals: ● Decrease boilerplate work ● Leverage automation ● Increase transparency ● Single source of truth for schemas Solution: Gitops paradigm with Protobuf & Schema Registry 13
  • 14. JSON to Protobuf: Schema Structures ● Legacy JSON schema structure use the concept of an envelope with various subdivisions ● Convert into Protobuf, preserve schema structure 14
  • 15. Protobuf Equivalent Protobuf features: ● Can express custom types and use composition to glue together schemas in 1 top level schema ● Has support for some native types like google.protobuf.Timestamp ● Ability to generate code from schema (e.g. Java, C#) 15
  • 16. Envelope per payload? ● Uken has between 200-300 event payloads per game ● One approach is to copy paste the envelope per payload per game Envelopes = Payloads X Games ● Downsides are duplication in schemas and generated code ● Leads to a poor developer experience 16
  • 17. “Oneof” Branching ● Try using the oneof construct to enumerate all possible payloads at the EventPayload level ● Only need an envelope per game or 1 mega envelope for all games Envelopes = O(Games) ● Similar to Avro Unions ● Still has duplication of envelope and EventPayload 17
  • 18. How to get around strict Protobuf schemas ● Each definition needs to be defined upfront and explicitly in Protobuf Solution: Defer schema attached until runtime to be able to reuse envelope across games 18
  • 19. Protobuf’s Any: Dynamic Message Container ● Allows you to use any embedded Protobuf type ● Uses special “packing” code ● The type_url only accepts Protobuf package names ● Requires class to be present in application 19 https://bit.ly/2Uo6RcS
  • 20. Schema Registry Compatible “Any” ● Enables attaching schemas at runtime ● Removes type_url for schema_id ● The value field is Proto3 encoded bytes 20
  • 21. Dynamic Envelope ● Can define one envelope for all games ● Use the AnyUken type for EventPayload ● Tradeoff explicitness for flexibility ● Elevate schema IDs to a first class concept within the schemas themselves (schema pointers) 21
  • 23. How do we get these schemas past the edge? 23
  • 24. Integrating Schema Registry with Mobile clients Problems: ● Generated Protobuf classes are not aware of schema IDs ● Client device needs access to schema IDs 24
  • 25. “Online” Approach (traditional) 25 ● Expose Schema Registry (SR) to mobile clients directly ● Fetch required schema IDs during app runtime Disadvantages: ● Need to setup security for schema registry ● Need to scale SR to millions of clients ● Point of failure for client ● Adds network overhead to client
  • 27. Embedded Tradeoffs ● No need to expose SR to millions of clients ● Leverage the immutable property of schema IDs ● Custom build process ● Total snapshot of schema IDs built into app 27
  • 28. Serverless + Schema Registry 28
  • 29. EventBatch wire format 29 ● Define a generic envelope as the API contract between API Gateway & Mobile client ● Able to take any resolvable schema, return error if bad data ● Avoids hardcoding a specific version of AnalyticsEvent ● Keeps wire format compliant to Proto3
  • 30. Nesting Schema IDs Visualized 30
  • 31. Performance Comparison 31 ● 3x improvement in latency to ingest a batch of events ● 2x reduction in size per event ● 2x increase in possible number of events buffered on device JSON Latency Protobuf Latency
  • 32. Schema Management: GitOps paradigm ● Version control Protobuf schemas in Git monorepo ● Use Merge Requests to collaborate on impending changes ● Run CI/CD on commits ● Place people into the right spots, let automation do the rest 32
  • 33. 33
  • 34. 34 Next Steps ● Adding a Data Dictionary ● Improving visibility with integrations to Slack ● Iterating on RACI matrix for data management ● Migrating Spark Jobs to utilize new data management process
  • 35. 35 Takeaways ● GitOps allows for data management automation ● Schema Registry can empower devices outside the data center ● Schema IDs allow flexible envelope designs that operate better at scale ● Protobuf can help reduce costs by several times