SlideShare a Scribd company logo
How built a framework to improve
infrastructure resource utilization at scale
★ Sr. Systems Engineer @Twitter

★ Proud Member of @TwitterWomen,
@WomenWhoCode
Iam@VinuCharanya
Hello!
3
1
2
3
4
History & Context

Chargeback @Twitter

Kite - Service Lifecycle Manager

Impact & Future Work
Agenda
History & Context
Thousandsof
MicroServices
Thousandsof
MicroServices
Thousandsof
MicroServices
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale
INFRASTRUCTURE & DATACENTER MANAGEMENT
CORE APPLICATION
SERVICES
TWEETS
USERS
SOCIAL
GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING &
QUEUES
CACHE
MONITORING AND
ALERTING
INGRESS &
PROXY


FRAMEWORK/
LIBRARIES
FINAGLE
(RPC)
SCALDING
(Map Reduce in
Scala)
HERON
(Streaming
Compute)
JVM


MANAGEMENT
TOOLS
SELF SERVE
SERVICE
DIRECTORY
CHARGEBACK
CONFIG
MGMT
DATA & ANALYTICS
PLATFORM
INTERACTIVE
QUERY
DATA
DISCOVERY
WORKFLOW
MANAGEMENT
INFRASTRUCTURE
SERVICES
MANHATTAN
BLOBSTORE
GRAPHSTORE
TIMESERIESDB
S
T
O
R
A
G
E
MESOS/AURORA
HADOOP
C
O
M
P
U
T
E
MYSQL
VERTICA
POSTGRES
D
B
/
D
W
DEPLOY

(Workflows)
MESOS/AURORA
HADOOP
MANHATTAN
67%
NumberofServers
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How to get visibility into resources used by

individual jobs & datasets?
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How to attribute resource consumption

to teams/organization?
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How do you incentivize the right behavior to 

improve efficiency of resource usage?
Chargeback @Twitter
Chargeback @Twitter
Ability to meter
allocation & utilization of resources
Chargeback @Twitter
Ability to meter
allocation & utilization of resources 

per service, 

per project, 

per engineering team
Chargeback @Twitter
Ability to meter
allocation & utilization of resources 

per service, 

per project, 

per engineering team 

to improve visibility & 

enable accountability
Features
Supports diverse
Infra Services
Chargeback @Twitter
18
Meters abstract
resources at daily
granularity
Detailed Reports
19
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
Support diverse Infrastructure and Platform Services
20
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource
Support diverse Infrastructure and Platform Services
21
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource
2. Resource <> Client Identifier Ownership: Map of client identifier to an
owner to enable accountability
Support diverse Infrastructure and Platform Services
OFFER MEASURE COST
RESOURCE CATALOG ENTITY MODEL
OFFER MEASURES
OFFER MEASURE COST
1:N
RESOURCE CATALOG ENTITY MODEL
PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
RESOURCE CATALOG ENTITY MODEL
TWITTER DC/
PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
RESOURCE CATALOG ENTITY MODEL
TWITTER DC/
PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
TWITTER DC
STORAGE
GB-
RAM
PROCESSING
CLUSTER
FILE
ACCESSES
…
…
GB-
RAM
FILE
ACCESSE
S
… …
$X $Y …$M $N… …
RESOURCE CATALOG ENTITY MODEL
{
measures: [
{
"measure_id": 1,
"measure_label": "core-days",
"measure_unit_label": "per 1 core-day",
"offering_id": 1,
"offering_label": "Compute",
"infrastructure_id": 1,
"infrastructure_name": "Aurora"
},
{
"measure_id": 2,
"measure_label": "machine-days",
"measure_unit_label": "per 1 machine-day",
"offering_id": 2,
"offering_label": "zone:aquila",
"infrastructure_id": 8,
"infrastructure_name": "Physical Infrastructure",
},
{
/api/1/measures
Chargeback @Twitter
So, how do you incentivize the right behavior to 

improve efficiency of resource usage?
Pricing is one way…
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server

($X / day)
Total available Cores
Quota Buffer

(Underutilized Quota)
Container Size Buffer

(Underutilized Reservation)
Total Cost of Ownership for Aurora
$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server

($X / day)
Total available Cores
Quota Buffer

(Underutilized Quota)
Container Size Buffer

(Underutilized Reservation)
Total used Cores
Total Cost of Ownership for Aurora
$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server

($X / day)
Total available Cores
Quota Buffer

(Underutilized Quota)
Container Size Buffer

(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)Total Cost of Ownership for Aurora
$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server

($X / day)
Total available Cores
Quota Buffer

(Underutilized Quota)
Container Size Buffer

(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)
Cores used by platform

for operations &
maintenance
Total Cost of Ownership for Aurora
$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server

($X / day)
Total available Cores
Quota Buffer

(Underutilized Quota)
Container Size Buffer

(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)
Cores used by platform

for operations &
maintenance
Total Cost of Ownership for Aurora
$X core-day
Our team would be …
Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
36
Meters abstract
resources at daily
granularity
Detailed Reports
37
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER
OWNERSHIP
MAPPING
Metrics Ingestor
DATA FIDELITY
Metering Pipeline (ETL Job)
38
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER
OWNERSHIP
MAPPING
Schema(client_identifier, offering_measure, volume, metadata, timestamp)
DATA FIDELITY
Metering Pipeline (ETL Job)
39
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
Transformer
DATA FIDELITY
Metering Pipeline (ETL Job)
40
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
1. Resolve Ownership
DATA FIDELITY
Metering Pipeline (ETL Job)
41
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
2. Cost Computation
DATA FIDELITY
Metering Pipeline (ETL Job)
42
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
DATA FIDELITY
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
Data Fidelity & Reporting
Metering Pipeline (ETL Job)
43
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
1. Verify Data Integrity & Fidelity
DATA FIDELITY
Metering Pipeline (ETL Job)
44
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
2. Alert when things don’t seem the way it should be
DATA FIDELITY
Metering Pipeline (ETL Job)
45
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
EXPORT
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
DATA FIDELITY
REPORT
REPORT
Metering Pipeline (ETL Job)
Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
46
Meters abstract
resources at daily
granularity
Detailed Reports
47
Chargeback @Twitter
Customers
Infrastructure & Platform Operators
Overall Cluster Growth

Allocation v/s Utilization of resources by Client/Tenant

Finance & Execs
Budget v/s Spend per Org

Infrastructure PnL

Overall Efficiency & Trends

Service Owners & Developers
Team Bill

Per Service Allocation vs. Utilization of Resources
Reports
Customers
Infrastructure & Platform Operators
Overall Cluster Growth

Allocation v/s Utilization of resources by Client/Tenant

Finance & Execs
Budget v/s Spend per Org

Infrastructure PnL

Overall Efficiency & Trends
INFRASTRUCTURE PNL
49
Chargeback @Twitter
Customers
Infrastructure & Platform Operators
Overall Cluster Growth

Allocation v/s Utilization of resources by Client/Tenant

Finance & Execs
Budget v/s Spend per Org

Infrastructure PnL

Overall Efficiency & Trends

Service Owners & Developers
Team Bill

Per Service Allocation vs. Utilization of Resources
Reports
CHARGEBACK BILL FOR A TEAM
CHARGEBACK DRILLDOWN FOR A TEAM
Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
52
Meters abstract
resources at daily
granularity
Detailed Reports
53
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad. 

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning
54
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad. 

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning
55
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad. 

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning
56
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad. 

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning
57
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad. 

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale
SERVICE IDENTITY
MANAGER
RESOURCE
PROVISIONING MANAGER
DASHBOARD
(SINGLE PANE OF GLASS)
REPORTING
INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE
SERVICE LIFECYCLE WORKFLOWS
METADATA
RESOURCE QUOTA
MANAGEMENT
METERING &
CHARGEBACK
CLIENT IDENTITY
PROVIDER APIS & ADAPTERS
10,000+ClientIdentifiers
1,000+ Projects
100+ Teams
8 InfrastructureServices
60
Kite @Twitter
61
Kite @Twitter
Identity System: Built a consistent way to group client identifiers of
different infrastructure services into a project and enabled ownership
• Capture Org Structure: Support org structure changes, project transfer
workflows to ensure up-to-date ownership of identifiers

• Unify client identifier provisioning workflow: Enables single source of truth
and reduces operator pain around provisioning and managing client identifiers.
Client Identifier Management
IDENTITY ENTITY MODEL
<INFRA, CLIENTID>
<Aurora,
tweetypie.prod.tweetypie>
<Aurora, ads-
prediction.prod.campaign-x>
IDENTITY ENTITY MODEL
SERVICE/

SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
BUSINESS OWNER
TEAM
PROJECT
SERVICE/

SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
BUSINESS OWNER
TEAM
PROJECT
SERVICE/

SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
Entities are time varying dimensions
Impact
10,000+
ClientIdentifiers
CLAIM OWNERSHIP
PROJECT DISCOVERY
PROJECT METADATA
AURORA QUOTA MANAGER
Future Work
73
Future Work
Impact & Future Work
1 2
Capacity Planning Extend Quota
Manager
• Provide historic trends
and help with forecast of
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale
75
1 2
Future Work
Impact & Future Work
Capacity Planning Extend Quota
Manager
• Provide historic trends
and help with forecast of
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
76
1 2
Future Work
Impact & Future Work
Capacity Planning Extend Quota
Manager
• Provide historic trends
and help with forecast of
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
@VinuCharanya

More Related Content

PDF
Changing landscapes in data integration - Kafka Connect for near real-time da...
PDF
How to Build an Apache Kafka® Connector
PDF
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
PDF
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
PDF
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
PDF
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
PPTX
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
PPTX
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Changing landscapes in data integration - Kafka Connect for near real-time da...
How to Build an Apache Kafka® Connector
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io

What's hot (20)

PDF
So You Want to Write a Connector?
PDF
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
PDF
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
PPTX
HPBigData2015 PSTL kafka spark vertica
PDF
Apache Pulsar Overview
PDF
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
PDF
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
PDF
What's New in Confluent Platform 5.5
PDF
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
PDF
Introduction to Kafka Streams
PDF
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
PDF
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
PDF
ksqlDB: A Stream-Relational Database System
PDF
Getting Started with Confluent Schema Registry
PDF
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
PDF
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
PDF
Exactly-once Data Processing with Kafka Streams - July 27, 2017
So You Want to Write a Connector?
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
HPBigData2015 PSTL kafka spark vertica
Apache Pulsar Overview
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
What's New in Confluent Platform 5.5
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Introduction to Kafka Streams
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
ksqlDB: A Stream-Relational Database System
Getting Started with Confluent Schema Registry
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
Exactly-once Data Processing with Kafka Streams - July 27, 2017

Similar to [Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale (20)

PDF
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
PDF
Microservices meetup April 2017
PDF
Keynote 1 the rise of stream processing for data management &amp; micro serv...
PPTX
Cloud Computing for Business - The Road to IT-as-a-Service
PDF
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
PDF
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
PDF
Modern real-time streaming architectures
PDF
Designing Modern Streaming Data Applications
PDF
CQRS and Event Sourcing: A DevOps perspective
PDF
The hidden engineering behind machine learning products at Helixa
PDF
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
PDF
PDF
From zero to hero with the actor model - Tamir Dresher - Odessa 2019
PDF
Proactive ops for container orchestration environments
PDF
Telefonica: Automatización de la gestión de redes mediante grafos
PDF
Transform Your Telecom Operations with Graph Technologies
PPTX
Accelerating analytics on the Sensor and IoT Data.
PDF
Big Data Seervices in Danaos Use Case
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
Microservices meetup April 2017
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Cloud Computing for Business - The Road to IT-as-a-Service
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Modern real-time streaming architectures
Designing Modern Streaming Data Applications
CQRS and Event Sourcing: A DevOps perspective
The hidden engineering behind machine learning products at Helixa
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
From zero to hero with the actor model - Tamir Dresher - Odessa 2019
Proactive ops for container orchestration environments
Telefonica: Automatización de la gestión de redes mediante grafos
Transform Your Telecom Operations with Graph Technologies
Accelerating analytics on the Sensor and IoT Data.
Big Data Seervices in Danaos Use Case
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
MIND Revenue Release Quarter 2 2025 Press Release
TLE Review Electricity (Electricity).pptx
Tartificialntelligence_presentation.pptx
A novel scalable deep ensemble learning framework for big data classification...
Chapter 5: Probability Theory and Statistics
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Hindi spoken digit analysis for native and non-native speakers
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Building Integrated photovoltaic BIPV_UPV.pdf
1 - Historical Antecedents, Social Consideration.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Getting Started with Data Integration: FME Form 101
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
Group 1 Presentation -Planning and Decision Making .pptx

[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

  • 1. How built a framework to improve infrastructure resource utilization at scale
  • 2. ★ Sr. Systems Engineer @Twitter ★ Proud Member of @TwitterWomen, @WomenWhoCode Iam@VinuCharanya Hello!
  • 3. 3 1 2 3 4 History & Context Chargeback @Twitter Kite - Service Lifecycle Manager Impact & Future Work Agenda
  • 9. INFRASTRUCTURE & DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING INGRESS & PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG MGMT DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN BLOBSTORE GRAPHSTORE TIMESERIESDB S T O R A G E MESOS/AURORA HADOOP C O M P U T E MYSQL VERTICA POSTGRES D B / D W DEPLOY
 (Workflows)
  • 11. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to get visibility into resources used by individual jobs & datasets?
  • 12. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to attribute resource consumption
 to teams/organization?
  • 13. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How do you incentivize the right behavior to 
 improve efficiency of resource usage?
  • 15. Chargeback @Twitter Ability to meter allocation & utilization of resources
  • 16. Chargeback @Twitter Ability to meter allocation & utilization of resources per service, per project, per engineering team
  • 17. Chargeback @Twitter Ability to meter allocation & utilization of resources per service, per project, per engineering team to improve visibility & enable accountability
  • 18. Features Supports diverse Infra Services Chargeback @Twitter 18 Meters abstract resources at daily granularity Detailed Reports
  • 19. 19 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory infrastructure resources Support diverse Infrastructure and Platform Services
  • 20. 20 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource Support diverse Infrastructure and Platform Services
  • 21. 21 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource 2. Resource <> Client Identifier Ownership: Map of client identifier to an owner to enable accountability Support diverse Infrastructure and Platform Services
  • 22. OFFER MEASURE COST RESOURCE CATALOG ENTITY MODEL
  • 23. OFFER MEASURES OFFER MEASURE COST 1:N RESOURCE CATALOG ENTITY MODEL
  • 24. PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL
  • 25. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL
  • 26. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N TWITTER DC STORAGE GB- RAM PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSE S … … $X $Y …$M $N… … RESOURCE CATALOG ENTITY MODEL
  • 27. { measures: [ { "measure_id": 1, "measure_label": "core-days", "measure_unit_label": "per 1 core-day", "offering_id": 1, "offering_label": "Compute", "infrastructure_id": 1, "infrastructure_name": "Aurora" }, { "measure_id": 2, "measure_label": "machine-days", "measure_unit_label": "per 1 machine-day", "offering_id": 2, "offering_label": "zone:aquila", "infrastructure_id": 8, "infrastructure_name": "Physical Infrastructure", }, { /api/1/measures Chargeback @Twitter
  • 28. So, how do you incentivize the right behavior to 
 improve efficiency of resource usage?
  • 29. Pricing is one way…
  • 30. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total Cost of Ownership for Aurora $X core-day
  • 31. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Total Cost of Ownership for Aurora $X core-day
  • 32. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation)Total Cost of Ownership for Aurora $X core-day
  • 33. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform
 for operations & maintenance Total Cost of Ownership for Aurora $X core-day
  • 34. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform
 for operations & maintenance Total Cost of Ownership for Aurora $X core-day
  • 35. Our team would be …
  • 36. Features Supports diverse Infra/Platform Services Chargeback @Twitter 36 Meters abstract resources at daily granularity Detailed Reports
  • 37. 37 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Metrics Ingestor DATA FIDELITY Metering Pipeline (ETL Job)
  • 38. 38 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Schema(client_identifier, offering_measure, volume, metadata, timestamp) DATA FIDELITY Metering Pipeline (ETL Job)
  • 39. 39 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT Transformer DATA FIDELITY Metering Pipeline (ETL Job)
  • 40. 40 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 1. Resolve Ownership DATA FIDELITY Metering Pipeline (ETL Job)
  • 41. 41 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 2. Cost Computation DATA FIDELITY Metering Pipeline (ETL Job)
  • 42. 42 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG DATA FIDELITY REPORT REPORT IDENTIFIER OWNERSHIP MAPPING Data Fidelity & Reporting Metering Pipeline (ETL Job)
  • 43. 43 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 1. Verify Data Integrity & Fidelity DATA FIDELITY Metering Pipeline (ETL Job)
  • 44. 44 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 2. Alert when things don’t seem the way it should be DATA FIDELITY Metering Pipeline (ETL Job)
  • 45. 45 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 EXPORT METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP DATA FIDELITY REPORT REPORT Metering Pipeline (ETL Job)
  • 46. Features Supports diverse Infra/Platform Services Chargeback @Twitter 46 Meters abstract resources at daily granularity Detailed Reports
  • 47. 47 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends
  • 49. 49 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports
  • 52. Features Supports diverse Infra/Platform Services Chargeback @Twitter 52 Meters abstract resources at daily granularity Detailed Reports
  • 53. 53 1 2 3 4 Learnings Chargeback @Twitter Invest in data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  • 54. 54 1 2 3 4 Learnings Chargeback @Twitter Invest in data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  • 55. 55 1 2 3 4 Learnings Chargeback @Twitter Invest in data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  • 56. 56 1 2 3 4 Learnings Chargeback @Twitter Invest in data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  • 57. 57 1 2 3 4 Learnings Chargeback @Twitter Invest in data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  • 59. SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF GLASS) REPORTING INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT METERING & CHARGEBACK CLIENT IDENTITY PROVIDER APIS & ADAPTERS
  • 60. 10,000+ClientIdentifiers 1,000+ Projects 100+ Teams 8 InfrastructureServices 60 Kite @Twitter
  • 61. 61 Kite @Twitter Identity System: Built a consistent way to group client identifiers of different infrastructure services into a project and enabled ownership • Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers • Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers. Client Identifier Management
  • 62. IDENTITY ENTITY MODEL <INFRA, CLIENTID> <Aurora, tweetypie.prod.tweetypie> <Aurora, ads- prediction.prod.campaign-x>
  • 63. IDENTITY ENTITY MODEL SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N tweetypie <Aurora, tweetypie.prod.tweetypie> ads-prediction <Aurora, ads- prediction.prod.campaign-x>
  • 64. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N 1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL
  • 65. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N 1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL Entities are time varying dimensions
  • 73. 73 Future Work Impact & Future Work 1 2 Capacity Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy
  • 75. 75 1 2 Future Work Impact & Future Work Capacity Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy
  • 76. 76 1 2 Future Work Impact & Future Work Capacity Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy