SlideShare a Scribd company logo
Zhenzhong Xu
Real-Time Data Infrastructure @ Netflix
Evolving Keystone to
an Open Collaborative
Real-time ETL Platform
Evolving Keystone to
an Open Collaborative
Real-time ETL Platform
Zhenzhong Xu
Real-Time Data Infrastructure
Making the world smile
Netflix is to bring diverse stories around the world to our global members, to entertain everyone and make the world smile!
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
and she is not alone ...
Content
Production 101
Bring joy to our members:
● Best in class experience
Give the customer the freedom,
flexibility and best in class
experience to enjoy the
entertainment.
Content
Production 101
Bring joy to our members:
● Best in class experience
● Produce more exciting content
Build a technology driven
Studio to empower storytelling
and content production.
Zhenzhong Xu
Publish, Collect, Move & Compute event data
in near real time @ Cloud Scale
Stream
Consumers
Router
Batch System
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Self Service UI
Mantis
Keystone is ...
… a single self-contained PaaS
Event Processing
Pipeline
Keystone is ...
… a multi-tenants, self-serving tool
Keystone is ...
Stream
Processing
Service
Transport /
Messaging
Service
Producer API
Control Plane
Consumer
API
Self Service UI
… powered by a collection of building blocks
Putting together ...
Stream
Consumers
Router
Batch System
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Self Service UI
mantis
Elliot is a data scientist works
in the Data Science
Engineering organization. His
main motivation is to bring the
magic out of customer data and
improve customer experience.
Charlie is an application
developer who works in Studio
organization. His main
motivation is leverage
technology to improve the
content production business.
Introducing our super heros ...
Elliot’s work will result in
better data powered analytic
engine, drive better
recommendation and
personalization features for
Emily.
Charlie’ work will increase
content production workflow
efficiency, and ultimately help
Eleven enjoy higher quality
content.
Data engineers/scientists prefers simplicity and faster turnaround time to
be effective at their work.
Lots of common data engineering patterns are not generalized.
The A/B test we are doing today takes 28 days to
complete, we don’t have a way to detect early issues
so we can optimize our experiments!
Elliot
Recommendation & personalization
Which artwork
to show?
A/B testing
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Real time
Reporting
Real time
Alerting
Faster training of
ML models
Resource
Efficiency
Why Stream Processing?
Anatomy of a stream processing job
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Stream Processing connector ecosystem in Netflix
● Hive
● Iceberg
● Kafka
● Elasticsearch
● ...more coming
// Example in Java
DataStream<YourOutputDataType> dataStream = ...
// attach Iceberg sink to the DAG
getSinkBuilder()
.toIceberg("<sink_name>", dataStream)
.buildAndAddSink();
// Example in Scala
val srcStream = getSourceBuilder
.fromKafka("example-kafka-source")
.buildScala
.map(r => r.getPayload)
getSinkBuilder.toIcebergScala[util.Map[String,Object]](config,
"iceSink", srcStream)
.buildAndAddSink()
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
3030
public class SpaasApplication extends SpaasBaseApplication {
@Override
public void constructJobDag(SpaasConfig config, StreamExecutionEnvironment env) {
// build a kafka source
SingleOutputStreamOperator<Record<Map<String, Object>>> sourceStream =
getSourceBuilder().fromKafka("kafkasource").build();
// dynamically pick the selected sink during deployment time
SinkFunction<Record<Map<String, Object>>> sink = getSinkBuilder().toSelector("dynamicsink")
.declareWith("noopsink", new NoopSink<Record<Map<String, Object>>>())
.or("kafkasink", getSinkBuilder().toKafka("kafkasink").build())
.build();
sourceStream.addSink(sink);
}
}
spaas.myTemplateNameSpace.source.names=dynamicsource
# "dynamicsource" is a dynamic selector that currently configured to pick kafka source
# the selector.selected configure can be override on SPaaS UI at runtime, and take effect after relaunch of the
job.
spaas.myTemplateNameSpace.source.dynamicsource.type=selector
spaas.myTemplateNameSpace.source.dynamicsource.selector.selected=kafka
spaas.myTemplateNameSpace.source.dynamicsource.selector.candidates=kafka,hive
Dynamic Source / Sink selector
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
3333
@Override
public void constructJobDag(SpaasConfig config, StreamExecutionEnvironment env) {
ObjectMapper myObjectMapper = new ObjectMapper();
DeserializationSchema<Person> deserializer = new PersonDeserializer();
TypedSerializationSchema<Person, Record<Map<String, Object>>> serializer= new PersonSerializer();
SingleOutputStreamOperator<Record<Person>> sourceFunction =
getSourceBuilder()
.fromKafka("kafka")
.withOutputType(Person.class)
.withDeserializer(deserializer)
.build();
SinkFunction<Person> sinkFunction =
getSinkBuilder()
.toNull("null")
.withType(Person.class)
.withSerializer(serializer)
.build();
sourceFunction.map(Record::getPayload).addSink(sinkFunction);
}
}
Record Abstractions
… and more
A
B
C
A Map C
Filter A
Filter B
Filter C
Extractor
Pattern
Join
Pattern
Enrichment
Pattern
Data Engineering ETL patterns
Lots of applications are built to support studio production, but performing
consistent data synchronization is hard.
Data are spread in different space/time, making data search and discovery
hard.
I can’t believe it takes a team weeks of time to
find out how many scripts are written by female
writers!
Charlie
LaunchProductionCreativeForecast Program Deals Post Production Financial
Reporting
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Heterogeneous Data
Synchronization
Operation
Reporting
Entity Search &
Discovery
Challenges
A case study: entity search
New event driven alternative ...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Challenge 1: Ordering Semantics
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Me,
4 years old
My uncle, 2
years old
When forcing a global generation order...
Can Bob and Dave be logically the
same generation?
Revisit the ancestry
tree
The cone shape shows the
causal/partial ordering from
Dave’s frame of reference.
The light cone
representing the past,
present, and future ...
https://en.wikipedia.org/wiki/Light_cone
In a distributed system, it is
sometimes impossible to say
that one of two events
occurred first. The relation
“happened before” is
therefore only a partial
ordering of the events in the
system.
Figure referenced from wikipedia: https://en.wikipedia.org/wiki/Vector_clock
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Figure on the right referenced from DataStax: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
Multi-Master replication
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Challenge 2: Processing Contracts
Message Contract
{ // Infra layer (Chaski)
"magicByte": 0x01, // hex
"version": "0x12", // hex
"attributes": {
"id": "chaski-id-from-keystone",
"app": "example-app",
"host": "localhost",
"timestamp": 1234, // long
},
"payload": { // Platform layer (PlatformRecord)
"id": 1,
"operation_ts_utc_usec": 1234, // long
"extended_attributes": {
"netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record.
"schema_id": 1,
"payload": {
"encryption_key": "some-rsa-pub"
}
}
},
"operation": "UPDATE",
"payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source.
"encrypted": false,
"format": "AVRO",
"schema_id": 1,
"payload": {
"row_id": 1,
"partition_id": 1,
"first_name": "net",
"last_name": "flix"
},
"remote_payload": null
},
"secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This
example is null to show that the source doesn't support pre-images.
}
}
Message
Contract
{ // Infra layer (Chaski)
"magicByte": 0x01, // hex
"version": "0x12", // hex
"attributes": {
"id": "chaski-id-from-keystone",
"app": "example-app",
"host": "localhost",
"timestamp": 1234, // long
},
"payload": { // Platform layer (PlatformRecord)
"id": 1,
"operation_ts_utc_usec": 1234, // long
"extended_attributes": {
"netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record.
"schema_id": 1,
"payload": {
"encryption_key": "some-rsa-pub"
}
}
},
"operation": "UPDATE",
"payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source.
"encrypted": false,
"format": "AVRO",
"schema_id": 1,
"payload": {
"row_id": 1,
"partition_id": 1,
"first_name": "net",
"last_name": "flix"
},
"remote_payload": null
},
"secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This
example is null to show that the source doesn't support pre-images.
}
}
Message
Contract
{ // Infra layer (Chaski)
"magicByte": 0x01, // hex
"version": "0x12", // hex
"attributes": {
"id": "chaski-id-from-keystone",
"app": "example-app",
"host": "localhost",
"timestamp": 1234, // long
},
"payload": { // Platform layer (PlatformRecord)
"id": 1,
"operation_ts_utc_usec": 1234, // long
"extended_attributes": {
"netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record.
"schema_id": 1,
"payload": {
"encryption_key": "some-rsa-pub"
}
}
},
"operation": "UPDATE",
"payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source.
"encrypted": false,
"format": "AVRO",
"schema_id": 1,
"payload": {
"row_id": 1,
"partition_id": 1,
"first_name": "net",
"last_name": "flix"
},
"remote_payload": null
},
"secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This
example is null to show that the source doesn't support pre-images.
}
}
Message
Contract
● Processor Metadata
● Configurations Management
● Processor capabilities
● Operation responsibilities
Processor Contract
A simple use case: notify upon new deal!
Open Composable Processors
Bring it all together - rethink ETL ...
Keystone
Routers
Flink
Platform
Keystone
Routers
Flink
Platform
Don’t waste
time here!
High Value!
Our first stab - an Open,
Collaborative,
Composable, Configurable ETL
Platform...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform
Unleash creativity via an open collaborative data platform @ Netflix
THANKS

More Related Content

PPTX
Time and ordering in streaming distributed systems
PPTX
Running a Massively Parallel Self-serve Distributed Data System At Scale
PDF
Reactive Design Patterns
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PDF
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
PDF
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
PPTX
Air traffic controller - Streams Processing meetup
PDF
Monitoring kubernetes across data center and cloud
Time and ordering in streaming distributed systems
Running a Massively Parallel Self-serve Distributed Data System At Scale
Reactive Design Patterns
Unbounded bounded-data-strangeloop-2016-monal-daxini
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Air traffic controller - Streams Processing meetup
Monitoring kubernetes across data center and cloud

What's hot (20)

PDF
Bootstrapping Microservices with Kafka, Akka and Spark
PDF
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
PDF
Reactive Streams 1.0 and Akka Streams
PPTX
Lifting the Blinds: Monitoring Windows Server 2012
PPTX
Harvesting the Power of Samza in LinkedIn's Feed
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PDF
Triangle Devops Meetup 10/2015
PDF
Microservices, Monoliths, SOA and How We Got Here
PPTX
Enforcing Application SLA with Congress and Monasca
PDF
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
PDF
NetflixOSS Meetup S6E1 - Titus & Containers
PPTX
Ceilosca
PDF
Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017
PDF
Building Stateful Microservices With Akka
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
PPTX
How to manage large amounts of data with akka streams
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PDF
Deploying Confluent Platform for Production
Bootstrapping Microservices with Kafka, Akka and Spark
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
Reactive Streams 1.0 and Akka Streams
Lifting the Blinds: Monitoring Windows Server 2012
Harvesting the Power of Samza in LinkedIn's Feed
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Triangle Devops Meetup 10/2015
Microservices, Monoliths, SOA and How We Got Here
Enforcing Application SLA with Congress and Monasca
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
NetflixOSS Meetup S6E1 - Titus & Containers
Ceilosca
Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017
Building Stateful Microservices With Akka
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
How to manage large amounts of data with akka streams
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Cassandra Summit 2014: Diagnosing Problems in Production
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Deploying Confluent Platform for Production
Ad

Similar to FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform (20)

PDF
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
PDF
Data Secrets From a Platform Engineer (Bilbro)
PDF
Infrastructure-as-code: bridging the gap between Devs and Ops
PDF
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
PDF
Systems Bioinformatics Workshop Keynote
PDF
A DevOps guide to Kubernetes
PPTX
Learn you some Ansible for great good!
PPTX
Building a system for machine and event-oriented data - Velocity, Santa Clara...
PPTX
ETL with SPARK - First Spark London meetup
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
Integrate Solr with real-time stream processing applications
PDF
Data Streaming Technology Overview
PDF
nuclio Overview October 2017
PPTX
Fabric - Realtime stream processing framework
PDF
Building a system for machine and event-oriented data - SF HUG Nov 2015
PDF
Puppet for Sys Admins
PDF
Serverless London 2019 FaaS composition using Kafka and CloudEvents
PPTX
Running High-Speed Serverless with nuclio
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
PDF
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Data Secrets From a Platform Engineer (Bilbro)
Infrastructure-as-code: bridging the gap between Devs and Ops
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Systems Bioinformatics Workshop Keynote
A DevOps guide to Kubernetes
Learn you some Ansible for great good!
Building a system for machine and event-oriented data - Velocity, Santa Clara...
ETL with SPARK - First Spark London meetup
Apache Beam: A unified model for batch and stream processing data
Integrate Solr with real-time stream processing applications
Data Streaming Technology Overview
nuclio Overview October 2017
Fabric - Realtime stream processing framework
Building a system for machine and event-oriented data - SF HUG Nov 2015
Puppet for Sys Admins
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Running High-Speed Serverless with nuclio
Scaling Big Data Mining Infrastructure Twitter Experience
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025

FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time ETL Platform

  • 1. Zhenzhong Xu Real-Time Data Infrastructure @ Netflix Evolving Keystone to an Open Collaborative Real-time ETL Platform
  • 2. Evolving Keystone to an Open Collaborative Real-time ETL Platform Zhenzhong Xu Real-Time Data Infrastructure
  • 3. Making the world smile Netflix is to bring diverse stories around the world to our global members, to entertain everyone and make the world smile!
  • 8. and she is not alone ...
  • 9. Content Production 101 Bring joy to our members: ● Best in class experience Give the customer the freedom, flexibility and best in class experience to enjoy the entertainment.
  • 10. Content Production 101 Bring joy to our members: ● Best in class experience ● Produce more exciting content Build a technology driven Studio to empower storytelling and content production.
  • 11. Zhenzhong Xu Publish, Collect, Move & Compute event data in near real time @ Cloud Scale Stream Consumers Router Batch System Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI Mantis
  • 12. Keystone is ... … a single self-contained PaaS Event Processing Pipeline
  • 13. Keystone is ... … a multi-tenants, self-serving tool
  • 14. Keystone is ... Stream Processing Service Transport / Messaging Service Producer API Control Plane Consumer API Self Service UI … powered by a collection of building blocks
  • 15. Putting together ... Stream Consumers Router Batch System Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI mantis
  • 16. Elliot is a data scientist works in the Data Science Engineering organization. His main motivation is to bring the magic out of customer data and improve customer experience. Charlie is an application developer who works in Studio organization. His main motivation is leverage technology to improve the content production business. Introducing our super heros ...
  • 17. Elliot’s work will result in better data powered analytic engine, drive better recommendation and personalization features for Emily. Charlie’ work will increase content production workflow efficiency, and ultimately help Eleven enjoy higher quality content.
  • 18. Data engineers/scientists prefers simplicity and faster turnaround time to be effective at their work. Lots of common data engineering patterns are not generalized. The A/B test we are doing today takes 28 days to complete, we don’t have a way to detect early issues so we can optimize our experiments! Elliot
  • 23. Real time Reporting Real time Alerting Faster training of ML models Resource Efficiency Why Stream Processing?
  • 24. Anatomy of a stream processing job
  • 26. Stream Processing connector ecosystem in Netflix ● Hive ● Iceberg ● Kafka ● Elasticsearch ● ...more coming // Example in Java DataStream<YourOutputDataType> dataStream = ... // attach Iceberg sink to the DAG getSinkBuilder() .toIceberg("<sink_name>", dataStream) .buildAndAddSink(); // Example in Scala val srcStream = getSourceBuilder .fromKafka("example-kafka-source") .buildScala .map(r => r.getPayload) getSinkBuilder.toIcebergScala[util.Map[String,Object]](config, "iceSink", srcStream) .buildAndAddSink()
  • 30. 3030 public class SpaasApplication extends SpaasBaseApplication { @Override public void constructJobDag(SpaasConfig config, StreamExecutionEnvironment env) { // build a kafka source SingleOutputStreamOperator<Record<Map<String, Object>>> sourceStream = getSourceBuilder().fromKafka("kafkasource").build(); // dynamically pick the selected sink during deployment time SinkFunction<Record<Map<String, Object>>> sink = getSinkBuilder().toSelector("dynamicsink") .declareWith("noopsink", new NoopSink<Record<Map<String, Object>>>()) .or("kafkasink", getSinkBuilder().toKafka("kafkasink").build()) .build(); sourceStream.addSink(sink); } } spaas.myTemplateNameSpace.source.names=dynamicsource # "dynamicsource" is a dynamic selector that currently configured to pick kafka source # the selector.selected configure can be override on SPaaS UI at runtime, and take effect after relaunch of the job. spaas.myTemplateNameSpace.source.dynamicsource.type=selector spaas.myTemplateNameSpace.source.dynamicsource.selector.selected=kafka spaas.myTemplateNameSpace.source.dynamicsource.selector.candidates=kafka,hive Dynamic Source / Sink selector
  • 33. 3333 @Override public void constructJobDag(SpaasConfig config, StreamExecutionEnvironment env) { ObjectMapper myObjectMapper = new ObjectMapper(); DeserializationSchema<Person> deserializer = new PersonDeserializer(); TypedSerializationSchema<Person, Record<Map<String, Object>>> serializer= new PersonSerializer(); SingleOutputStreamOperator<Record<Person>> sourceFunction = getSourceBuilder() .fromKafka("kafka") .withOutputType(Person.class) .withDeserializer(deserializer) .build(); SinkFunction<Person> sinkFunction = getSinkBuilder() .toNull("null") .withType(Person.class) .withSerializer(serializer) .build(); sourceFunction.map(Record::getPayload).addSink(sinkFunction); } } Record Abstractions
  • 35. A B C A Map C Filter A Filter B Filter C Extractor Pattern Join Pattern Enrichment Pattern Data Engineering ETL patterns
  • 36. Lots of applications are built to support studio production, but performing consistent data synchronization is hard. Data are spread in different space/time, making data search and discovery hard. I can’t believe it takes a team weeks of time to find out how many scripts are written by female writers! Charlie
  • 37. LaunchProductionCreativeForecast Program Deals Post Production Financial Reporting
  • 40. A case study: entity search
  • 41. New event driven alternative ...
  • 45. Me, 4 years old My uncle, 2 years old
  • 46. When forcing a global generation order... Can Bob and Dave be logically the same generation?
  • 48. The cone shape shows the causal/partial ordering from Dave’s frame of reference.
  • 49. The light cone representing the past, present, and future ... https://en.wikipedia.org/wiki/Light_cone
  • 50. In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation “happened before” is therefore only a partial ordering of the events in the system.
  • 51. Figure referenced from wikipedia: https://en.wikipedia.org/wiki/Vector_clock
  • 53. Figure on the right referenced from DataStax: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html Multi-Master replication
  • 58. { // Infra layer (Chaski) "magicByte": 0x01, // hex "version": "0x12", // hex "attributes": { "id": "chaski-id-from-keystone", "app": "example-app", "host": "localhost", "timestamp": 1234, // long }, "payload": { // Platform layer (PlatformRecord) "id": 1, "operation_ts_utc_usec": 1234, // long "extended_attributes": { "netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record. "schema_id": 1, "payload": { "encryption_key": "some-rsa-pub" } } }, "operation": "UPDATE", "payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source. "encrypted": false, "format": "AVRO", "schema_id": 1, "payload": { "row_id": 1, "partition_id": 1, "first_name": "net", "last_name": "flix" }, "remote_payload": null }, "secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This example is null to show that the source doesn't support pre-images. } } Message Contract
  • 59. { // Infra layer (Chaski) "magicByte": 0x01, // hex "version": "0x12", // hex "attributes": { "id": "chaski-id-from-keystone", "app": "example-app", "host": "localhost", "timestamp": 1234, // long }, "payload": { // Platform layer (PlatformRecord) "id": 1, "operation_ts_utc_usec": 1234, // long "extended_attributes": { "netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record. "schema_id": 1, "payload": { "encryption_key": "some-rsa-pub" } } }, "operation": "UPDATE", "payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source. "encrypted": false, "format": "AVRO", "schema_id": 1, "payload": { "row_id": 1, "partition_id": 1, "first_name": "net", "last_name": "flix" }, "remote_payload": null }, "secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This example is null to show that the source doesn't support pre-images. } } Message Contract
  • 60. { // Infra layer (Chaski) "magicByte": 0x01, // hex "version": "0x12", // hex "attributes": { "id": "chaski-id-from-keystone", "app": "example-app", "host": "localhost", "timestamp": 1234, // long }, "payload": { // Platform layer (PlatformRecord) "id": 1, "operation_ts_utc_usec": 1234, // long "extended_attributes": { "netflix/data.platform/data.mesh/cassandra": { // Platform Developer-defined Avro record. "schema_id": 1, "payload": { "encryption_key": "some-rsa-pub" } } }, "operation": "UPDATE", "payload": { // User layer. Represents state of the event after an update operation occurred upstream at the source. "encrypted": false, "format": "AVRO", "schema_id": 1, "payload": { "row_id": 1, "partition_id": 1, "first_name": "net", "last_name": "flix" }, "remote_payload": null }, "secondary_payload": null // User layer. Represents state of the event before the operation occurred upstream at the source. This example is null to show that the source doesn't support pre-images. } } Message Contract
  • 61. ● Processor Metadata ● Configurations Management ● Processor capabilities ● Operation responsibilities Processor Contract
  • 62. A simple use case: notify upon new deal!
  • 64. Bring it all together - rethink ETL ...
  • 66. Keystone Routers Flink Platform Don’t waste time here! High Value! Our first stab - an Open, Collaborative, Composable, Configurable ETL Platform...
  • 69. Unleash creativity via an open collaborative data platform @ Netflix