SlideShare a Scribd company logo
Trafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters &
Improving Performance at Scale
​Michael Kehoe
​Staff Site Reliability Engineer
​LinkedIn
3
Overview
• Problem Statement
• Solution – How LinkedIn trafficshift’s
• Datacenter shifting
• PoP steering
• Challenges of APAC region
• IPv4 vs IPv6
• Questions
$ whoami
4
Michael Kehoe
• Staff Site Reliability Engineer (SRE) @ LinkedIn
• Production-SRE team
• Funny accent = Australian + 3 years American
$ whatis SRE
5
Michael Kehoe
• Site Reliability Engineering
• Operations for the production application
environment
• Responsibilities include
• Architecture design
• Capacity planning
• Operations
• Tooling
• Responsibilities include DNS/ CDN management &
Traffic infrastructure
6
Terminology
• PoP - Where LinkedIn terminates incoming requests.
• Fabric – Datacenter with full LinkedIn production stack deployed
• Loadtest – Stress test of a Fabric – to simulate a disaster scenario
Disaster Recovery
7
Problem Statement
• Fail between Fabrics
• Performance of applications is degraded
• Validate disaster recovery (DR) scenario
• Expose bugs and suboptimal configurations via loadtest
• Planned maintenance
• Fail between PoP’s
• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)
• Software/ Configuration Bugs
Performance
8
Problem Statement
• Fabric Assignment
• Assign preferred and secondary fabric to all members based on:
• Member location
• Capacity
• PoP/ CDN steering
• Use GeoDNS to steer user to ‘best’ PoP
• Use RUM DNS to steer users to ’best’ CDN
United States Performance (Global)
9
Problem Statement
APAC Performance (APAC cities)
10
Problem Statement
Delta US & APAC
11
Problem Statement
Site Speed
12
Problem Statement
• Site Speed affects User Engagement
• User Engagement affects page-views & transactions
• Bottom Line: Site Speed has an impact on revenue
• Site Speed affects User Engagement
13
Problem Statement
LinkedIn’s Traffic Architecture
14
Solution
LinkedIn’s Traffic Architecture
15
Solution
Fabric shifting
16
Solution
• Stickyrouting
• Using a Hadoop job, we calculate a primary and
secondary datacenter for the user based on
location
• This data is stored in a Key-Value store
(Espresso)
• Stickyrouting serves this information over a
RESTful interface to our Edge PoP’s
Fabric shifting
17
Solution
• Different traffic types are partitioned and controlled separately
• Logged-In vs Logged-out
• CDN’s
• Monitoring
• Microsites
• Logged-in users are placed into ‘buckets’
• Buckets are marked online/ offline to move site traffic
Fabric shifting
18
Solution
• Stickyrouting – Benefits
• Ensure we serve the request as close to the user as possible
• Capacity management for datacenters
• We can assign a percentage of users to a datacenter
• Enables personal data routing (PDR)
• Only store data where we need it
Fabric shifting Automation
19
Solution
Fabric shifting Automation
20
Solution
Fabric Shifting
21
Solution
Fabric Shifting Load tests
22
Solution
Fabric Shifting Loadtests
23
Solution
LinkedIn’s Traffic Architecture
24
Solution
LinkedIn’s PoP Distribution
25
Solution
LinkedIn’s PoP Architecture
26
Solution
• Using IPVS - Each PoP announces a unicast address and a regional anycast
address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for
www.linkedin.com
• US and EU members is nearly all anycast
• APAC is all unicast
LinkedIn’s PoP DR
27
Solution
• Sometimes need to fail out of PoP’s
• 3rd party provider issues (e.g. transit links
going down)
• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast
traffic
LinkedIn’s PoP Performance
28
Solution
• PoP DNS Steering
• LinkedIn currently uses GeoDNS for routing
• Piloting RumDNS
• Pick the best PoP based on network, not country
• CDN Steering
• Mix CDN’s to get best performance
• Constantly evaluate performance/ availability
• Automatically adjust CDN weighting
LinkedIn’s PoP Performance
29
Solution
US CDN request time 50th percentile 24 hours
Working around fiber cuts
30
APAC Challenges
• Case Study: Fail out of India PoP due to fiber cuts
Connection Time for Indian members (90th percentile)
ASN 15802
ASN 5384
GeoDNS Suboptimal PoP’s
31
APAC Challenges
Source: http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg
SingaporeMumbai
45 ms
220 ms
70 ms
ASN 15802 RTT to Singapore is (220+70) 290ms (all at 50th percentile)
GeoDNS Suboptimal PoP’s
32
APAC Challenges
London
Dublin
SingaporeMumbai
160 ms
45 ms
ASN 15802
ASN 5384
70 ms
35 ms
350 ms
Hong
Kong160 ms
GeoDNS Suboptimal PoP’s
33
APAC Challenges
600
700
800
900
1000
1100
1200
Performance & Adoption
34
IPv4 vs IPv6
• IPv6 performs better for our members
• Less request time-outs on IPv6 for mobile users
• Mobile carriers are adopting IPv6 faster
• Win for LinkedIn and our members!
• In July 2014 (IPv6 launch): 3% of traffic was IPv6
• Today: ~12% of traffic is IPv6
Key Takeaways
35
Conclusion
• Application level traffic engineering is extremely important for content providers
• RUM data is extremely useful for finding anomalies
• Route traffic based on performance, not just location
• IPv6 performs better for LinkedIn users
36
Questions?
Trafficshifting: Avoiding Disasters & Improving Performance at Scale

More Related Content

PPTX
Seamless database migration case study - from Firebase real-time database to ...
PDF
4th SDN Interest Group Seminar-Session 2-3(130313)
PDF
WEBridge 4 SAP ( Windchill and SAP Integration)
PDF
WEBridge 4 EBS ( Windchill and Oracle EBS Integration )
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PPT
Performance and Scalability Tuning
PDF
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Seamless database migration case study - from Firebase real-time database to ...
4th SDN Interest Group Seminar-Session 2-3(130313)
WEBridge 4 SAP ( Windchill and SAP Integration)
WEBridge 4 EBS ( Windchill and Oracle EBS Integration )
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Performance and Scalability Tuning
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...

What's hot (20)

PDF
PEARC17: Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region
PDF
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
PDF
Apache Kafka® at Dropbox
PPTX
The Past, Present, and Future of Apache Flink
PPTX
Distributed Kafka Architecture Taboola Scale
PDF
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
PPTX
HKIX IPv4 Address Renumbering from /23 to /21 - Experience Sharing
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
PPTX
Stream Processing @ Lyft
PDF
How to build an event driven architecture with kafka and kafka connect
PPTX
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
PDF
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
PDF
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
PDF
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
PDF
Building High Performance APIs In Go Using gRPC And Protocol Buffers
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
PDF
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
PPTX
Sitecore Data Exchange Framework
PEARC17: Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Performance Tuning RocksDB for Kafka Streams’ State Stores
Apache Kafka® at Dropbox
The Past, Present, and Future of Apache Flink
Distributed Kafka Architecture Taboola Scale
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
HKIX IPv4 Address Renumbering from /23 to /21 - Experience Sharing
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Stream Processing @ Lyft
How to build an event driven architecture with kafka and kafka connect
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Sitecore Data Exchange Framework
Ad

Viewers also liked (20)

PDF
Logging/Request Tracing in Distributed Environment
PDF
AT&T Shape Hackathon Kick-off
PDF
BGP Routing Table Report
PDF
Tech Talk: Introduction to SDN/NFV Assurance (CA Virtual Network Assurance)
PPT
Perl在nginx里的应用
PDF
20 years of the Internet in Vietnam: Think about the I in the Internet
PDF
Improve your supply chain with Acctivate & B2BGateway | B2BGateway co-hosted ...
PDF
BCOP BoF
PDF
Large BGP Communities
PDF
Production Ready Services at Netflix
PDF
Software Reliability Engineering
PDF
Addressing 2016
PDF
Fusion Cloud Data Centers: a new high tech frontier
PDF
prop-117: Returned IPv4 address management and Final /8 exhaustion
PDF
Internet Resource Management (IRM) & Internet Routing Registry (IRR)
PDF
AT&T IoT Civic Hackathon @ IndyPy
PDF
BGP Peering Strategy and Data
PDF
Beyond 100GE
PDF
Deploy MPLS Traffic Engineering
PDF
Technical and Business Considerations for DNSSEC Deployment
Logging/Request Tracing in Distributed Environment
AT&T Shape Hackathon Kick-off
BGP Routing Table Report
Tech Talk: Introduction to SDN/NFV Assurance (CA Virtual Network Assurance)
Perl在nginx里的应用
20 years of the Internet in Vietnam: Think about the I in the Internet
Improve your supply chain with Acctivate & B2BGateway | B2BGateway co-hosted ...
BCOP BoF
Large BGP Communities
Production Ready Services at Netflix
Software Reliability Engineering
Addressing 2016
Fusion Cloud Data Centers: a new high tech frontier
prop-117: Returned IPv4 address management and Final /8 exhaustion
Internet Resource Management (IRM) & Internet Routing Registry (IRR)
AT&T IoT Civic Hackathon @ IndyPy
BGP Peering Strategy and Data
Beyond 100GE
Deploy MPLS Traffic Engineering
Technical and Business Considerations for DNSSEC Deployment
Ad

Similar to Trafficshifting: Avoiding Disasters & Improving Performance at Scale (20)

PPTX
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
PPTX
Play With Streams
PDF
PLNOG 3: John Evans - Best Practices in Network Planning
PDF
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
PDF
How can Big data accelerate CDN services ?
PDF
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
PPT
Rzepnicki_thesis_presentation_2003(2) (1)
PPTX
Migration from Oracle to PostgreSQL: NEED vs REALITY
PPTX
Motor vehicle emission checker danu-lap
PDF
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
PPTX
ngs07.data-center.ssadasdasdasdlides.pptx
PPTX
Data Centre of the Future and challenges
PPTX
Hybrid Cloud Journey - Maximizing Private and Public Cloud
PDF
Improving Resource Utilization in Cloud using Application Placement Heuristics
PPTX
Zero Downtime Critical Traffic Migration @Netflix Scale
PPTX
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
PDF
Freedom of Movement for redisconf19
PPTX
Druid Optimizations for Scaling Customer Facing Analytics
PDF
PLNOG19 - Piotr Marecki - Espresso: Scalable and Programmable Peering Edge
PDF
Решения WANDL и NorthStar для операторов
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
Play With Streams
PLNOG 3: John Evans - Best Practices in Network Planning
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
How can Big data accelerate CDN services ?
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
Rzepnicki_thesis_presentation_2003(2) (1)
Migration from Oracle to PostgreSQL: NEED vs REALITY
Motor vehicle emission checker danu-lap
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
ngs07.data-center.ssadasdasdasdlides.pptx
Data Centre of the Future and challenges
Hybrid Cloud Journey - Maximizing Private and Public Cloud
Improving Resource Utilization in Cloud using Application Placement Heuristics
Zero Downtime Critical Traffic Migration @Netflix Scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Freedom of Movement for redisconf19
Druid Optimizations for Scaling Customer Facing Analytics
PLNOG19 - Piotr Marecki - Espresso: Scalable and Programmable Peering Edge
Решения WANDL и NorthStar для операторов

More from APNIC (20)

PPTX
APNIC Report, presented at APAN 60 by Thy Boskovic
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PDF
DNSSEC Made Easy, presented at PHNOG 2025
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
PDF
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
PDF
IPv6 Deployment and Best Practices, presented by Makito Lay
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
PDF
The Internet - By the numbers, presented at npNOG 11
PDF
Transmission Control Protocol (TCP) and Starlink
PDF
DDoS in India, presented at INNOG 8 by Dave Phelan
PDF
Global Networking Trends, presented at the India ISP Conclave 2025
PDF
Make DDoS expensive for the threat actors
PDF
Fast Reroute in SR-MPLS, presented at bdNOG 19
PDF
DDos Mitigation Strategie, presented at bdNOG 19
PDF
ICP -2 Review – What It Is, and How to Participate and Provide Your Feedback
PDF
APNIC Update - Global Synergy among the RIRs: Connecting the Regions
PDF
Measuring Starlink Protocol Performance, presented at LACNIC 43
APNIC Report, presented at APAN 60 by Thy Boskovic
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
RPKI Status Update, presented by Makito Lay at IDNOG 10
The Internet -By the Numbers, Sri Lanka Edition
Triggering QUIC, presented by Geoff Huston at IETF 123
DNSSEC Made Easy, presented at PHNOG 2025
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
IPv6 Deployment and Best Practices, presented by Makito Lay
Cleaning up your RPKI invalids, presented at PacNOG 35
The Internet - By the numbers, presented at npNOG 11
Transmission Control Protocol (TCP) and Starlink
DDoS in India, presented at INNOG 8 by Dave Phelan
Global Networking Trends, presented at the India ISP Conclave 2025
Make DDoS expensive for the threat actors
Fast Reroute in SR-MPLS, presented at bdNOG 19
DDos Mitigation Strategie, presented at bdNOG 19
ICP -2 Review – What It Is, and How to Participate and Provide Your Feedback
APNIC Update - Global Synergy among the RIRs: Connecting the Regions
Measuring Starlink Protocol Performance, presented at LACNIC 43

Recently uploaded (20)

PPT
tcp ip networks nd ip layering assotred slides
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Introduction to the IoT system, how the IoT system works
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Funds Management Learning Material for Beg
DOCX
Unit-3 cyber security network security of internet system
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
innovation process that make everything different.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
artificial intelligence overview of it and more
PPTX
Introduction to Information and Communication Technology
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
tcp ip networks nd ip layering assotred slides
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Introduction to the IoT system, how the IoT system works
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Sims 4 Historia para lo sims 4 para jugar
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Module 1 - Cyber Law and Ethics 101.pptx
Funds Management Learning Material for Beg
Unit-3 cyber security network security of internet system
Job_Card_System_Styled_lorem_ipsum_.pptx
innovation process that make everything different.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
artificial intelligence overview of it and more
Introduction to Information and Communication Technology
international classification of diseases ICD-10 review PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx

Trafficshifting: Avoiding Disasters & Improving Performance at Scale