Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka

The Source of Truth
2017-05-08
The New York Times
Why The New York Times
Stores Every Piece of Content
Ever Published
in Kafka

Boerge Svingen
Director of Engineering
at the New York Times,
working on backend systems.

Topic:
How is published content
made available to the front-end
applications?

CMS
CMS
Archives
Web
iOS
Android
Etc.
Producersofcontent
Consumersofcontent
Etc.
Etc.
Etc.

Agenda
1. A little history
2. How things used to work
3. The source of truth
4. Implementation
5. Lessons so far

Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka

Source: http://www.nytimes.com/1865/04/15/news/president-lincoln-
shot-assassin-deed-done-ford-s-theatre-last-night-act.html

The New York Times Company Archives

Producersofcontent
Consumersofcontent
Personal-
ization
CMS
CMS
Archives
Web
iOS
Android
Etc.
Search
Etc.
Etc.
Etc.
Etc.
Etc.
Etc.

A rather typical API-based
architecture.

Disadvantages with this approach …
The consumers have to know about all the
producers of content.

Every API tends to be different.

Every API tends to return data with a
different (no) schema.

We have no efficient way of reading old
content in bulk, so it’s hard to replace
service stores.

Most services have to manage permanent
state.

It is difficult to change the (non-existent)
schema, leading to inconsistencies and
duplication.

We get monoliths that try to be everything
for everyone.

It’s hard to develop new products and
change current ones.

Agenda
1. A little history
2. How things used to work
3. The source of truth
3. Implementation
4. Lessons so far

CMS
CMS
Archives
Web
iOS
Android
Etc.
Producersofcontent
Consumersofcontent
Kafka
Gateway
Search
Personal-
ization
Collections
Etc.
Etc.
Etc.
GraphQLAPI
Etc.
Etc.
Etc.

Article 1
Dateline 1
Credit 1
Section 1
Image 2
Image 1
Credit 2
Article 2
Section 2
Image 3

The schema.
The Gateway validates all assets before
they go on the log.

The schema.
All assets are identified by a URI:
nyt://article/186faf12-24a0-4dda-b737-018cee0b81cd

The schema.
Custom linter to check for forwards and
backward compatibility.

The schema.
GraphQL schema is automatically
generated from the protobuf schema.

The Monolog
Single partition, totally ordered, infinite
retention.

The Monolog
The Source of Truth for published content.

The Monolog
Contains everything published since 1851.

Article 1
Dateline 1 Credit 1Section 1
Image 2
Image 1Credit 2
Article 2 Section 2 Image 3
Topological sort

Section1
Dateline1
Credit1
Credit2
Image1
Image2
Image3
Section2
Article2
Article1
Image2,version2
Credit2,version2

The denormalized log
Replicated from the monolog.

Updates the full asset every time a
dependency is updated.

Makes it easier for consumers that need all
the dependencies.

Partitioned by asset ID.

Article1
Dateline1
Credit1
Section1
Image2
Image1
Credit2
Article2
Section2
Image3
Dateline1
Image1
Credit2
Article1
Dateline1
Credit1
Section1
Image2,version2
Image1
Credit2

Special-purpose logs
All replicated from the monolog.

Producersofcontent
Consumersofcontent
Kafka
broker
Kafka
broker
Kafka
broker
Kafka
broker
Kafka
broker
ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper
Gateway
Gateway
Gateway
Gateway
ReplicatorsReplicatorsReplicatorsReplicators
GKE
(Kubernetes)
GKE
(Kubernetes)
Google
Compute
Engine
gRPC
over
Cloud
Endpoint
Kafka
Consumer
over SSL

Producersofcontent
Consumersofcontent
us-east
us-central
us-west
GraphQLAPI
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Producersofcontent
Consumersofcontent
us-east
us-central
us-west
GraphQLAPI
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers
Gateway
Etc.
Etc.Kafka
consumers

Managed Kafka
would be very nice.

(Assuming it would run on Google Cloud.)

Log-based architectures
are still very new.

Google PubSub/SNS/SQS/Kinesis are not
replacements for Kafka.

It will take us a while to move all services over
to the new architecture.

Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka

More Related Content

What's hot (20)

Similar to Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka (20)

More from confluent (20)

Recently uploaded (20)

Kafka Summit NYC 2017 - The Source of Truth: Why the New York Times Stores Every Piece of Content Ever Published in Kafka