Schemas Beyond The Edge

Schemas Beyond The Edge
Alexei Zenin
Platform Engineer, Uken Games
August 17th, 2021

Journey
Mobile Analytics at Uken Games
Key Features of Schema Registry
Revamped Mobile Analytics Pipeline: JSON to Protobuf
2

Mobile Analytics at Uken Games
Collection of data to drive:
● Retention & engagement analysis (AB tests, DAU)
● Operational debugging & investigation
(purchases, rewards)
● Lower-level metric analysis (HTTP, CPU, Memory)
Focus will be on the ﬁrst two
3

Current Ingestion Pipeline Design
4
~200 million events per day

Schema Silos
5
● Each silo has its own processes and
tools
● Several teams managing the same data
asset deﬁnitions
● Duplication of work across the
boundaries
○ Structure
○ Data types
○ Naming

Pros & Cons
Pros
● Adding a new event is quick
● JSON is easy to use
● JSON well supported across systems
Cons
● Duplicated effort across teams for data management
● Repeated schema deﬁnitions
● Manual processes to keep things in sync
● JSON retransmits schema information each time
7

Overview of Schema Registry
● Provides a central place to upload your
schemas
● Immutable & Idempotent API
● Acts as “barcode system”
9
https://bit.ly/3ikXUci

Default Wire Format (Conﬂuent)
10

Single Primary Architecture
11
● Primary is elected via
Kafka Group Protocol
● Each Secondary is informed of
the primary’s address
● Primary is responsible for writes
(e.g. registering new schemas)
● Every node can serve reads or
forward write requests

Ingestion Pipeline 2.0: Protobuf
12

Ingestion Pipeline 2.0
Goals:
● Decrease boilerplate work
● Leverage automation
● Increase transparency
● Single source of truth for schemas
Solution:
Gitops paradigm with Protobuf & Schema Registry
13

JSON to Protobuf: Schema Structures
● Legacy JSON schema structure use the concept of an envelope with various subdivisions
● Convert into Protobuf, preserve schema structure
14

Protobuf Equivalent
Protobuf features:
● Can express custom types and use
composition to glue together
schemas in 1 top level schema
● Has support for some native types
like google.protobuf.Timestamp
● Ability to generate code from schema
(e.g. Java, C#)
15

Envelope per payload?
● Uken has between 200-300 event
payloads per game
● One approach is to copy paste the
envelope per payload per game
Envelopes = Payloads X Games
● Downsides are duplication in schemas
and generated code
● Leads to a poor developer experience
16

“Oneof” Branching
● Try using the oneof construct to
enumerate all possible payloads at
the EventPayload level
● Only need an envelope per game or 1
mega envelope for all games
Envelopes = O(Games)
● Similar to Avro Unions
● Still has duplication of envelope and
EventPayload
17

How to get around strict Protobuf schemas
● Each deﬁnition needs to be deﬁned upfront and explicitly in Protobuf
Solution:
Defer schema attached until runtime to be able to reuse envelope across games
18

Protobuf’s Any: Dynamic Message Container
● Allows you to use any embedded
Protobuf type
● Uses special “packing” code
● The type_url only accepts Protobuf
package names
● Requires class to be present in
application
19
https://bit.ly/2Uo6RcS

Schema Registry Compatible “Any”
● Enables attaching schemas at runtime
● Removes type_url for schema_id
● The value ﬁeld is Proto3 encoded bytes
20

Dynamic Envelope
● Can define one envelope for
all games
● Use the AnyUken type for
EventPayload
● Tradeoff explicitness for
flexibility
● Elevate schema IDs to a first
class concept within the
schemas themselves
(schema pointers)
21

How do we get these schemas
past the edge?
23

Integrating Schema Registry with Mobile clients
Problems:
● Generated Protobuf classes are not aware of schema IDs
● Client device needs access to schema IDs
24

“Online” Approach (traditional)
25
● Expose Schema Registry (SR) to
mobile clients directly
● Fetch required schema IDs
during app runtime
Disadvantages:
● Need to setup security for
schema registry
● Need to scale SR to millions of
clients
● Point of failure for client
● Adds network overhead to
client

Embedded Tradeoﬀs
● No need to expose SR to millions of clients
● Leverage the immutable property of schema
IDs
● Custom build process
● Total snapshot of schema IDs built into app
27

Serverless + Schema Registry
28

EventBatch wire format
29
● Deﬁne a generic envelope as the API contract
between API Gateway & Mobile client
● Able to take any resolvable schema, return
error if bad data
● Avoids hardcoding a speciﬁc version of
AnalyticsEvent
● Keeps wire format compliant to Proto3

Nesting Schema IDs Visualized
30

Performance Comparison
31
● 3x improvement in latency to ingest a
batch of events
● 2x reduction in size per event
● 2x increase in possible number of
events buffered on device
JSON Latency
Protobuf Latency

Schema Management: GitOps paradigm
● Version control Protobuf schemas in Git monorepo
● Use Merge Requests to collaborate on impending changes
● Run CI/CD on commits
● Place people into the right spots, let automation do the rest
32

34
Next Steps
● Adding a Data Dictionary
● Improving visibility with integrations to Slack
● Iterating on RACI matrix for data management
● Migrating Spark Jobs to utilize new data management process

35
Takeaways
● GitOps allows for data management
automation
● Schema Registry can empower devices outside
the data center
● Schema IDs allow ﬂexible envelope designs that
operate better at scale
● Protobuf can help reduce costs by several times

Schemas Beyond The Edge

More Related Content

What's hot (20)

Similar to Schemas Beyond The Edge (20)

More from confluent (20)

Recently uploaded (20)

Schemas Beyond The Edge