SlideShare a Scribd company logo
The Source of Truth
2017-05-08
The New York Times
Why The New York Times
Stores Every Piece of Content
Ever Published
in Kafka
Boerge Svingen
Director of Engineering
at the New York Times,
working on backend systems.
Topic:
How is published content
made available to the front-end
applications?
CMS
CMS
Archives
Web
iOS
Android
Etc.
Producersofcontent
Consumersofcontent
Etc.
Etc.
Etc.
Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far
Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far
Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka
Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka
Source: http://www.nytimes.com/1865/04/15/news/president-lincoln-
shot-assassin-deed-done-ford-s-theatre-last-night-act.html
20 years on the web
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far
Producersofcontent
Consumersofcontent
Personal-
ization
CMS
CMS
Archives
Web
iOS
Android
Etc.
Search
Etc.
Etc.
Etc.
Etc.
Etc.
Etc.
A rather typical API-based
architecture.
Disadvantages with this approach …
The consumers have to know about all the
producers of content.
Disadvantages with this approach …
Every API tends to be different.
Disadvantages with this approach …
Every API tends to return data with a
different (no) schema.
Disadvantages with this approach …
We have no efficient way of reading old
content in bulk, so it’s hard to replace
service stores.
Disadvantages with this approach …
Most services have to manage permanent
state.
Disadvantages with this approach …
It is difficult to change the (non-existent)
schema, leading to inconsistencies and
duplication.
Disadvantages with this approach …
We get monoliths that try to be everything
for everyone.
Disadvantages with this approach …
It’s hard to develop new products and
change current ones.
Agenda
1. A little history
2. How things used to work
3. The source of truth
3. Implementation
4. Lessons so far
The Publishing Pipeline
CMS
CMS
Archives
Web
iOS
Android
Etc.
Producersofcontent
Consumersofcontent
Kafka
Gateway
Search
Personal-
ization
Collections
Etc.
Etc.
Etc.
GraphQLAPI
Etc.
Etc.
Etc.
We have a schema.
The schema.
Uses proto3.
The schema.
Is normalized.
Article 1
Dateline 1
Credit 1
Section 1
Image 2
Image 1
Credit 2
Article 2
Section 2
Image 3
The schema.
The Gateway validates all assets before
they go on the log.
The schema.
All assets are identified by a URI:
nyt://article/186faf12-24a0-4dda-b737-018cee0b81cd
The schema.
Custom linter to check for forwards and
backward compatibility.
The schema.
GraphQL schema is automatically
generated from the protobuf schema.
Monolog.
The Monolog
Single partition, totally ordered, infinite
retention.
The Monolog
The Source of Truth for published content.
The Monolog
Contains everything published since 1851.
Article 1
Dateline 1
Credit 1
Section 1
Image 2
Image 1
Credit 2
Article 2
Section 2
Image 3
Article 1
Dateline 1 Credit 1Section 1
Image 2
Image 1Credit 2
Article 2 Section 2 Image 3
Topological sort
Section1
Dateline1
Credit1
Credit2
Image1
Image2
Image3
Section2
Article2
Article1
Image2,version2
Credit2,version2
Denormalized log.
The denormalized log
Replicated from the monolog.
The denormalized log
Updates the full asset every time a
dependency is updated.
The denormalized log
Makes it easier for consumers that need all
the dependencies.
The denormalized log
Partitioned by asset ID.
Article1
Dateline1
Credit1
Section1
Image2
Image1
Credit2
Article2
Section2
Image3
Dateline1
Image1
Credit2
Article1
Dateline1
Credit1
Section1
Image2,version2
Image1
Credit2
Special-purpose logs
Special-purpose logs
All replicated from the monolog.
Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far
Runs on Google Cloud.
Producersofcontent
Consumersofcontent
Kafka
broker
Kafka
broker
Kafka
broker
Kafka
broker
Kafka
broker
ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper
Gateway
Gateway
Gateway
Gateway
ReplicatorsReplicatorsReplicatorsReplicators
GKE
(Kubernetes)
GKE
(Kubernetes)
Google
Compute
Engine
gRPC
over
Cloud
Endpoint
Kafka
Consumer
over SSL
Passive
replication.
Producersofcontent
Consumersofcontent
us-east
us-central
us-west
GraphQLAPI
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Producersofcontent
Consumersofcontent
us-east
us-central
us-west
GraphQLAPI
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far
Managed Kafka
would be very nice.
(Assuming it would run on Google Cloud.)
Log-based architectures
are still very new.
Google PubSub/SNS/SQS/Kinesis are not
replacements for Kafka.
It’s still early days.
It will take us a while to move all services over
to the new architecture.
Questions?

More Related Content

PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PPTX
Robust Stream Processing with Apache Flink
PPTX
Extending the Yahoo Streaming Benchmark
PDF
Apache Kafka lessons learned @PAYBACK
PPTX
Portable Streaming Pipelines with Apache Beam
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
PDF
A look at Flink 1.2
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Robust Stream Processing with Apache Flink
Extending the Yahoo Streaming Benchmark
Apache Kafka lessons learned @PAYBACK
Portable Streaming Pipelines with Apache Beam
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Apache Flink Community Updates November 2016 @ Berlin Meetup
A look at Flink 1.2

What's hot (20)

PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
Stream Processing with Apache Flink
PPTX
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
PDF
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
PPTX
QCon London - Stream Processing with Apache Flink
PDF
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
PDF
Deploying Confluent Platform for Production
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
PPTX
Flink 1.0-slides
PDF
Big Data Warsaw
PPTX
Streaming in the Wild with Apache Flink
PPTX
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
PPTX
Counting Elements in Streams
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PPTX
The Evolution of (Open Source) Data Processing
PPTX
data Artisans Product Announcement
PPTX
Flink Community Update December 2015: Year in Review
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Aljoscha Krettek - The Future of Apache Flink
Stream Processing with Apache Flink
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
QCon London - Stream Processing with Apache Flink
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Deploying Confluent Platform for Production
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Flink 1.0-slides
Big Data Warsaw
Streaming in the Wild with Apache Flink
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Counting Elements in Streams
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
The Evolution of (Open Source) Data Processing
data Artisans Product Announcement
Flink Community Update December 2015: Year in Review
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Ad

Similar to Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka (20)

PDF
Apache Kafka® Delivers a Single Source of Truth for The New York Times
PPTX
Media Shifts
PDF
Stranger Things: The Forces that Disrupt Netflix
PPTX
Muckraking '18
PPT
Smart Cities….Smart Future
PPT
Web 3.0 and Dutch journalism by Raymond Franz
PDF
Guardian Open Platform Launch Event
PDF
Network Source of Truth and Infrastructure as Code revisited
PDF
Real-Time Web Overview
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
PPTX
Big data, a city of things and civic innovation
PPT
PPT
NPR API: Create Once Publish Everywhere
PPTX
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
PPTX
Why We Need a Dark(er) Web
PPT
Emerging Media as a Mind Amplifier
PPTX
What to expect from Journalism Innovation in 2017?
PDF
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
PPT
Cittadellarte
PPT
The Future of the Internet
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Media Shifts
Stranger Things: The Forces that Disrupt Netflix
Muckraking '18
Smart Cities….Smart Future
Web 3.0 and Dutch journalism by Raymond Franz
Guardian Open Platform Launch Event
Network Source of Truth and Infrastructure as Code revisited
Real-Time Web Overview
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Big data, a city of things and civic innovation
NPR API: Create Once Publish Everywhere
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Why We Need a Dark(er) Web
Emerging Media as a Mind Amplifier
What to expect from Journalism Innovation in 2017?
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Cittadellarte
The Future of the Internet
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Transform Your Business with a Software ERP System
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPT
Introduction Database Management System for Course Database
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Introduction to Artificial Intelligence
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How Creative Agencies Leverage Project Management Software.pdf
L1 - Introduction to python Backend.pptx
Online Work Permit System for Fast Permit Processing
Transform Your Business with a Software ERP System
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction Database Management System for Course Database
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Navsoft: AI-Powered Business Solutions & Custom Software Development
Adobe Illustrator 28.6 Crack My Vision of Vector Design
VVF-Customer-Presentation2025-Ver1.9.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Which alternative to Crystal Reports is best for small or large businesses.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka