SlideShare a Scribd company logo
The Stack Exchange 
Infrastructure 
Vroom Vroom
inet.perf.profile 
• SRE Generalist @ Stack Exchange 
• @GABeech 
• http://brokenhaze.com 
• http://stackexchange.com
A brief Overview 
• 560 Million Page Views a Month 
• 34TB of Data transfered a Month 
• 1665 rps (2250 peak) Across web Farm 
• WISC(HER) 
Windows 
IIS 
SQL Server 
C# 
HAProxy 
Elastic Search 
Redis
Our First Priority is 
Performance 
Nobody likes a slow site, least of all us. 
When your site is slow people leave. 
! 
Make your site fast, and the people will stay 
! 
Good write up on moz.com: 
http://moz.com/blog/site-speed-are-you-fast-does-it-matter 
Why do I bring up performance in an infra talk? simple. It drives our design decisions.
The Performance 
toolkit 
• Mini Profiler 
• OpServer (https://github.com/opserver/ 
Opserver) 
• Client Timings (http:// 
teststackoverflow.com/)
Mini Profiler 
Shown to every Dev/SRE on every page 
Oneboxed in our chat system
OpServer 
Bubbles up problems
OpServer HAproxy
OpServer Redis
OpServer SQL
Client Timings 
How well are we actually doing when _you_ load the page
You can’t be fast if you 
are not up 
• Highly Redundant network 
• Datacenter, ISP, Edge, Core, Server, Port 
The actual design starts now.
4 Different providers 
Selected for different characteristics 
Router Redundancy Hot/Standby HSRP/BGP on “T2” 
Full BGP tables and HSRP on T1
Load Balencers 
• HAProxy 
• 2 Servers (Hot/Standby) 
• Multiple Tiers (HAProxy Processes) 
4B requests/month 
3000 req/sec peak 
10% CPU 18% peak 
Between 600 and 700 concurrent connections (EST, TIME_WAIT, ETC) 
Multiple Processes Allow for granular restarts and segregation of faults 
SSL Termination done on the LB 
Websockets: The weird connection 
Long lived 
TCP not HTTP
Request flow 
In, is http? yes, servers: no term https, is http
SSL Termination 
• Terminated at LB 
• Feature added to HAProxy 1.5 
• See: http://brokenhaze.com/blog/ 
2014/03/25/how-stack-exchange-gets-the-most- 
out-of-haproxy/ 
Source Port Exhaustion 
use 127.0.0.0/8 to resolve 
Server only running at ~12% cpu 
We don’t run full SSL everywhere yet
Web Servers 
! 
• IIS 
• 9 Production (2 Test/Dev) 
• Dell R610’s 
• 32GB Memory 
• 2xE5-5640 
185 req/s 250 peak 
15% CPU usage 20% peak
Data Tier 
• MS SQL Server 
• 4 Servers 
• 2 Always-On Clusters 
• Each Cluster 1 RW, 1 RO 
(SO) 343 M Queries per day 
(SO) Peak of 7500 queries / second 
(SE) 216M Queries per day 
(SE) Peak 3200 queries / second 
! 
CPU Use: SO 8% Peak 15% — SE 10% Peak 20%
Caching Tier 
• Redis 
• 2 Servers 
• Hot / Standby configuration 
3.65 B operations a day 
Peak 60,000/s 
3% cpu usage 
!
Tag Engine 
• Our Special index of SO 
• Tagging is hard 
• Written by Marc Gravell 
• http://blog.marcgravell.com/2014/04/technical-debt- 
case-study-tags.html 
3 Servers, 32 GB RAM 
3644 req/s 
3% CPU 10% peak 
Replaced Full Text search in SQL Server 
Spins up a full copy of SO/SE 
Cool thing can be upgraded with 0 downtime
Elastic Search 
• 203GB Index 
• 3 Machines 
• 42M searches/day 
2 others/ not prod 
Machine learning 
Log stash (300TB)
Deployment 
• Git 
• TeamCity 
• Custom Powershell Scripts 
Team City monitors our Development Git repository 
Dev Auto builds (Deploy to Meta) 
When the build is verified Dev triggers Prod Build 
Copy Artifacts from Dev Build
So what does this get 
you 
• 52 ms homepage render time 
• 33 ms questions page render time
Always See our 
Performance 
• http://stackexchange.com/performance
Thank YOU! 
Contact: 
@GABeech 
george@stackoverflow.com 
Office Hours: 
Wednesday, November 12th 
(today…) 
2:00pm - 3:30pm 
LISA Lab

More Related Content

PPTX
Branch Management in Git Fusion
PDF
美团点评技术沙龙08 - 分布式监控系统实践
PDF
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
PDF
Cluster Fudge: Recipes for WordPress in the Cloud (WordCamp Austin 2014 Speaker)
PPTX
RavenDB 4.0
PPTX
RavenDB 3.5
PDF
Introduction to selenium_grid_workshop
PPTX
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
Branch Management in Git Fusion
美团点评技术沙龙08 - 分布式监控系统实践
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
Cluster Fudge: Recipes for WordPress in the Cloud (WordCamp Austin 2014 Speaker)
RavenDB 4.0
RavenDB 3.5
Introduction to selenium_grid_workshop
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

What's hot (20)

PDF
Building & Testing Scalable Rails Applications
PPTX
Managing Large Selenium Grid
PDF
buk.io Serverless deployment
PPTX
Infrastructure modeling with chef
PPTX
Stabilizing SE Build - Selenium conf 2013
PDF
[Perforce] Adventures in Build
PPTX
Real Time Recommendations Using WebSockets and Redis - Ninad Divadkar, Inuit
PPTX
Grpc load balancing
PPTX
Debugging the Web with Fiddler
PPTX
Infrastructure as Code
PPTX
Repository performance tuning
PDF
Selenium grid workshop london 2016
PDF
Capistrano
PPTX
Scalable Text File Service with MongoDB (Intuit)
KEY
Selenium Grid
PPTX
Command box, Package Manager, Automation, REPL
PPTX
Scaling and Managing Selenium Grid
PPTX
OSGifying the repository
PPT
Big Data DC - BenchPress
PPTX
Dev-Friendly Ops
Building & Testing Scalable Rails Applications
Managing Large Selenium Grid
buk.io Serverless deployment
Infrastructure modeling with chef
Stabilizing SE Build - Selenium conf 2013
[Perforce] Adventures in Build
Real Time Recommendations Using WebSockets and Redis - Ninad Divadkar, Inuit
Grpc load balancing
Debugging the Web with Fiddler
Infrastructure as Code
Repository performance tuning
Selenium grid workshop london 2016
Capistrano
Scalable Text File Service with MongoDB (Intuit)
Selenium Grid
Command box, Package Manager, Automation, REPL
Scaling and Managing Selenium Grid
OSGifying the repository
Big Data DC - BenchPress
Dev-Friendly Ops
Ad

Similar to Stack Exchange Infrastructure - LISA 14 (20)

PPTX
Stack Exchange Infrastructure - LISA 14
PPTX
Scaling Stack Overflow (QCon NYC 2015)
PPT
StackOverflow Architectural Overview
PPSX
Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...
DOCX
Architecting extremelylargescalewebapplications
PPTX
Software architecture for high traffic website
PPTX
Web Performance
PPTX
Deep Dive into SharePoint Topologies and Server Architecture for SharePoint 2013
PPTX
Architecting extremelylarge scale web applications
PDF
The Web Scale
PDF
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
PPTX
Scalability -
PPTX
Design and implementation patterns for reviving relational monoliths
PDF
Fluent 2018: Tracking Performance of the Web with HTTP Archive
PDF
Common Sense Performance Indicators in the Cloud
PPTX
Build Web Applications using Microservices on Node.js and Serverless AWS
PDF
Expecto Performa! The Magic and Reality of Performance Tuning
PDF
SharePoint 2010 Boost your farm performance!
PDF
High performance PHP8 at Scale - PhpersSummit 2023
KEY
Performance and scalability with drupal
Stack Exchange Infrastructure - LISA 14
Scaling Stack Overflow (QCon NYC 2015)
StackOverflow Architectural Overview
Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...
Architecting extremelylargescalewebapplications
Software architecture for high traffic website
Web Performance
Deep Dive into SharePoint Topologies and Server Architecture for SharePoint 2013
Architecting extremelylarge scale web applications
The Web Scale
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Scalability -
Design and implementation patterns for reviving relational monoliths
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Common Sense Performance Indicators in the Cloud
Build Web Applications using Microservices on Node.js and Serverless AWS
Expecto Performa! The Magic and Reality of Performance Tuning
SharePoint 2010 Boost your farm performance!
High performance PHP8 at Scale - PhpersSummit 2023
Performance and scalability with drupal
Ad

Recently uploaded (20)

PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
Introduction to Information and Communication Technology
PDF
Testing WebRTC applications at scale.pdf
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
artificial intelligence overview of it and more
PPTX
Digital Literacy And Online Safety on internet
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
international classification of diseases ICD-10 review PPT.pptx
Introduction to Information and Communication Technology
Testing WebRTC applications at scale.pdf
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
artificial intelligence overview of it and more
Digital Literacy And Online Safety on internet
presentation_pfe-universite-molay-seltan.pptx
QR Codes Qr codecodecodecodecocodedecodecode
WebRTC in SignalWire - troubleshooting media negotiation
Triggering QUIC, presented by Geoff Huston at IETF 123
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Unit-1 introduction to cyber security discuss about how to secure a system
introduction about ICD -10 & ICD-11 ppt.pptx
Introuction about WHO-FIC in ICD-10.pptx
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Job_Card_System_Styled_lorem_ipsum_.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Tenda Login Guide: Access Your Router in 5 Easy Steps
Slides PDF The World Game (s) Eco Economic Epochs.pdf
522797556-Unit-2-Temperature-measurement-1-1.pptx

Stack Exchange Infrastructure - LISA 14

  • 1. The Stack Exchange Infrastructure Vroom Vroom
  • 2. inet.perf.profile • SRE Generalist @ Stack Exchange • @GABeech • http://brokenhaze.com • http://stackexchange.com
  • 3. A brief Overview • 560 Million Page Views a Month • 34TB of Data transfered a Month • 1665 rps (2250 peak) Across web Farm • WISC(HER) Windows IIS SQL Server C# HAProxy Elastic Search Redis
  • 4. Our First Priority is Performance Nobody likes a slow site, least of all us. When your site is slow people leave. ! Make your site fast, and the people will stay ! Good write up on moz.com: http://moz.com/blog/site-speed-are-you-fast-does-it-matter Why do I bring up performance in an infra talk? simple. It drives our design decisions.
  • 5. The Performance toolkit • Mini Profiler • OpServer (https://github.com/opserver/ Opserver) • Client Timings (http:// teststackoverflow.com/)
  • 6. Mini Profiler Shown to every Dev/SRE on every page Oneboxed in our chat system
  • 11. Client Timings How well are we actually doing when _you_ load the page
  • 12. You can’t be fast if you are not up • Highly Redundant network • Datacenter, ISP, Edge, Core, Server, Port The actual design starts now.
  • 13. 4 Different providers Selected for different characteristics Router Redundancy Hot/Standby HSRP/BGP on “T2” Full BGP tables and HSRP on T1
  • 14. Load Balencers • HAProxy • 2 Servers (Hot/Standby) • Multiple Tiers (HAProxy Processes) 4B requests/month 3000 req/sec peak 10% CPU 18% peak Between 600 and 700 concurrent connections (EST, TIME_WAIT, ETC) Multiple Processes Allow for granular restarts and segregation of faults SSL Termination done on the LB Websockets: The weird connection Long lived TCP not HTTP
  • 15. Request flow In, is http? yes, servers: no term https, is http
  • 16. SSL Termination • Terminated at LB • Feature added to HAProxy 1.5 • See: http://brokenhaze.com/blog/ 2014/03/25/how-stack-exchange-gets-the-most- out-of-haproxy/ Source Port Exhaustion use 127.0.0.0/8 to resolve Server only running at ~12% cpu We don’t run full SSL everywhere yet
  • 17. Web Servers ! • IIS • 9 Production (2 Test/Dev) • Dell R610’s • 32GB Memory • 2xE5-5640 185 req/s 250 peak 15% CPU usage 20% peak
  • 18. Data Tier • MS SQL Server • 4 Servers • 2 Always-On Clusters • Each Cluster 1 RW, 1 RO (SO) 343 M Queries per day (SO) Peak of 7500 queries / second (SE) 216M Queries per day (SE) Peak 3200 queries / second ! CPU Use: SO 8% Peak 15% — SE 10% Peak 20%
  • 19. Caching Tier • Redis • 2 Servers • Hot / Standby configuration 3.65 B operations a day Peak 60,000/s 3% cpu usage !
  • 20. Tag Engine • Our Special index of SO • Tagging is hard • Written by Marc Gravell • http://blog.marcgravell.com/2014/04/technical-debt- case-study-tags.html 3 Servers, 32 GB RAM 3644 req/s 3% CPU 10% peak Replaced Full Text search in SQL Server Spins up a full copy of SO/SE Cool thing can be upgraded with 0 downtime
  • 21. Elastic Search • 203GB Index • 3 Machines • 42M searches/day 2 others/ not prod Machine learning Log stash (300TB)
  • 22. Deployment • Git • TeamCity • Custom Powershell Scripts Team City monitors our Development Git repository Dev Auto builds (Deploy to Meta) When the build is verified Dev triggers Prod Build Copy Artifacts from Dev Build
  • 23. So what does this get you • 52 ms homepage render time • 33 ms questions page render time
  • 24. Always See our Performance • http://stackexchange.com/performance
  • 25. Thank YOU! Contact: @GABeech george@stackoverflow.com Office Hours: Wednesday, November 12th (today…) 2:00pm - 3:30pm LISA Lab