SlideShare a Scribd company logo
Best Practices for Scaling Websites Lessons from eBay Randy Shoup eBay Distinguished Architect QCon Tokyo 2009.04.10
Challenges at Internet Scale eBay users trade over $2000 in goods every second -- $60 billion per year   eBay site stores and actively uses over 2 PB of data  eBay processes 50 TB of new, incremental data per day  eBay Data Warehouse analyzes 50 PB per day eBay manages … 86.3 million active users worldwide 120 million items for sale in 50,000 categories Over 2 billion page views per day In a dynamic environment 300+ features per quarter We roll 100,000+ lines of code every two weeks   In 39 countries, in 8 languages, 24x7x365 >48 Billion SQL executions/day!
Architectural Forces at Internet Scale Scalability Resource usage should increase linearly (or better!) with load Design for 10x growth in data, traffic, users, etc. Availability Resilience to failure  Rapid recoverability from failure Graceful degradation Latency User experience latency Data / execution latency Manageability Simplicity Maintainability Diagnostics Cost Development effort and complexity Operational cost (TCO)
Best Practices for Scaling Partition Everything  Asynchrony Everywhere  Automate Everything Remember Everything Fails  Embrace Inconsistency
Best Practice 1:  Partition Everything Split every problem into manageable chunks Split by data, load, or usage characteristics “ If you can’t split it, you can’t scale it” Motivations Scalability Can scale different segments independently Can scale by adding more partitions Availability Can isolate failures to a single segment or partition Manageability Can independently configure or upgrade a single partition  Cost Can choose partition size to maximize price-performance
Best Practice 1:  Partition Everything Pattern:  Functional Segmentation Segment processing into pools, services, and stages Segment data by entity and usage characteristics Pattern:  Horizontal Split Load-balance processing Within a pool, all servers are created equal Split (or “ shard ”) data along primary access path Partition by modulo of a key, range, lookup, etc. Corollary:  No Session State User session flow moves through multiple application pools Absolutely no session state in application tier Session state maintained in cookie, URL, or database
Best Practice 2:  Asynchrony Everywhere Prefer Asynchronous Processing Move as much processing as possible to asynchronous flows Where possible, integrate components asynchronously Motivations Scalability Can independently scale components A and B Availability Allows component A or B to be temporarily unavailable Can retry operations Latency Can reduce user experience latency at the cost of increasing data/execution latency Can allocate more time to processing than user would tolerate Cost Can dampen peaks in load by queueing work for later Can provision resources for the average load rather than the peak load
Best Practice 2:  Asynchrony Everywhere Pattern:  Event Queue Primary application writes data and produces event Create event transactionally with primary insert/update (e.g.,  ITEM.NEW ,  ITEM.BID ,  ITEM.SOLD )  Consumers subscribe to event At least once delivery, rather than exactly once No guaranteed order, rather than in-order Idempotency and readback Pattern:  Message Multicast Search Feeder publishes item updates Reads item updates from primary database Publishes sequenced updates via multicast to search grid Search engines listen to assigned subset of messages Update in-memory index in real time Request recovery when messages  are missed
Best Practice 3:  Automate Everything Prefer Adaptive / Automated Systems to Manual Systems Motivations Scalability Can scale with machines, not humans Availability Can adapt to changing environment more rapidly Cost Machines are far less expensive than humans Can adjust themselves over time without manual effort
Best Practice 3:  Automate Everything Pattern:  Adaptive Configuration Do not manually configure event consumers Number of threads, polling frequency, batch size, etc. Define SLA for a given consumer E.g., process 99% of events within 15 seconds Each consumer dynamically adjusts to meet defined SLA Consumers automatically adapt to changes in  Load Event processing time Number of consumer instances Pattern:  Machine Learning Dynamically adapt search experience On every request, determine best inventory and assemble optimal page for that user and context Feedback loop enables system to learn and improve over time Collect user behavior Aggregate and analyze offline Deploy updated metadata Decide and serve appropriate experience
Best Practice 4:  Remember Everything Fails Build all systems to be tolerant of failure Assume every operation will fail and every resource will be unavailable Detect failure as rapidly as possible Recover from failure as rapidly as possible Do as much as possible during failure Motivation Availability
Best Practice 4:  Remember Everything Fails Pattern:  Failure Detection Servers log all requests Log all application activity, database and service calls on  multicast message bus Over 2TB of log messages per day Listeners automate failure detection and notification Pattern:  Rollback Absolutely no changes to the site which cannot be undone (!) Every feature has on / off state driven by central configuration Feature can be immediately turned off for operational or business reasons Features can be deployed “wired-off” Pattern:  Graceful Degradation Application “marks down” a resource if it is unavailable or distressed Application removes or ignores non-critical operations  Application retries critical operations or defers them to an asynchronous event
Best Practice 5:  Embrace Inconsistency Brewer’s CAP Theorem  Any shared-data system can have  at most two  of the following properties: C onsistency:  All clients see the same data, even in the presence of updates A vailability:  All clients will get a response, even in the presence of failures P artition-tolerance:  The system properties hold even when the network is partitioned This trade-off is  fundamental  to all distributed systems
Best Practice 5:  Embrace Inconsistency Choose Appropriate Consistency Guarantees Typically eBay trades off immediate consistency in order to guarantee availability and partition-tolerance Most real-world systems do not require immediate consistency (even financial systems!)  Consistency is a spectrum Prefer eventual consistency to immediate consistency  Avoid Distributed Transactions eBay does absolutely no distributed transactions – no two-phase commit Minimize inconsistency through state machines and careful ordering of database operations Reach eventual consistency through asynchronous event or reconciliation batch
Concluding Messages Message for the Audience:  Best Practices for Scaling Partition Everything  Asynchrony Everywhere  Automate Everything Remember Everything Fails  Embrace Inconsistency  My Dream as an Engineer To build very large distributed systems that are highly efficient, scalable, and resilient
Questions? About the Presenter Randy Shoup  has been the primary architect for eBay's search infrastructure since 2004.  Prior to eBay, Randy was Chief Architect and Technical Fellow at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica.  [email_address]

More Related Content

PPT
Teaching Machines to Fish -- How eBay Improves Itself
PPTX
Being Elastic -- Evolving Programming for the Cloud
PPTX
Pragmatic Microservices
PPTX
Service Architectures At Scale - QCon London 2015
PPTX
DevOpsDays Silicon Valley 2014 - The Game of Operations
PPTX
From the Monolith to Microservices - CraftConf 2015
PPTX
Concurrency at Scale: Evolution to Micro-Services
PPTX
Evolving Architecture and Organization - Lessons from Google and eBay
Teaching Machines to Fish -- How eBay Improves Itself
Being Elastic -- Evolving Programming for the Cloud
Pragmatic Microservices
Service Architectures At Scale - QCon London 2015
DevOpsDays Silicon Valley 2014 - The Game of Operations
From the Monolith to Microservices - CraftConf 2015
Concurrency at Scale: Evolution to Micro-Services
Evolving Architecture and Organization - Lessons from Google and eBay

What's hot (20)

PPTX
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
PPTX
Why Enterprises Are Embracing the Cloud
PPTX
An Agile Approach to Machine Learning
PPTX
Top 5 Java Performance Metrics, Tips & Tricks
PDF
FMK2016 - HOunz Koudelka - Audit and Optimization
PDF
devops, microservices, and platforms, oh my!
PPTX
PPTX
Moving Fast At Scale
PPTX
Event Driven Architectures - Net Conf UY 2018
PPTX
2015 Mastering SAP Tech - Enterprise Mobility - Testing Lessons Learned
PPTX
DevOps - It's About How We Work
PPTX
Rest in Practice, Brazil 2010
PPTX
Monoliths, Migrations, and Microservices
PPTX
2016 Mastering SAP Tech - 2 Speed IT and lessons from an Agile Waterfall eCom...
PPTX
11 Goals of High Functioning SQL Developers
PPTX
Service Architectures at Scale
PDF
Large Scale JIRA Administration
PDF
Own Your Own Impact: Incident Response at Airbnb [FutureStack16]
PPT
When small problems become big problems
PPTX
Design Like a Pro: Machine Learning Basics
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Why Enterprises Are Embracing the Cloud
An Agile Approach to Machine Learning
Top 5 Java Performance Metrics, Tips & Tricks
FMK2016 - HOunz Koudelka - Audit and Optimization
devops, microservices, and platforms, oh my!
Moving Fast At Scale
Event Driven Architectures - Net Conf UY 2018
2015 Mastering SAP Tech - Enterprise Mobility - Testing Lessons Learned
DevOps - It's About How We Work
Rest in Practice, Brazil 2010
Monoliths, Migrations, and Microservices
2016 Mastering SAP Tech - 2 Speed IT and lessons from an Agile Waterfall eCom...
11 Goals of High Functioning SQL Developers
Service Architectures at Scale
Large Scale JIRA Administration
Own Your Own Impact: Incident Response at Airbnb [FutureStack16]
When small problems become big problems
Design Like a Pro: Machine Learning Basics
Ad

Viewers also liked (15)

PPT
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...
PPTX
Flowcon2013 - Virtuous Cycles of Velocity: What I Learned About Going Fast at...
PPTX
A CTO's Guide to Scaling Organizations
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
eBay Architecture
PDF
AWS Primer and Quickstart
PDF
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
PDF
Microservices Workshop All Topics Deck 2016
PDF
Building a Recommendation Engine - An example of a product recommendation engine
PPT
Startup Metrics for Pirates
PPTX
Lean Agile Metrics And KPIs
PDF
Netflix marketing plan
PDF
Culture Code: Creating A Lovable Company
PPTX
Culture
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...
Flowcon2013 - Virtuous Cycles of Velocity: What I Learned About Going Fast at...
A CTO's Guide to Scaling Organizations
Semantic & Multilingual Strategies in Lucene/Solr
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
eBay Architecture
AWS Primer and Quickstart
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Microservices Workshop All Topics Deck 2016
Building a Recommendation Engine - An example of a product recommendation engine
Startup Metrics for Pirates
Lean Agile Metrics And KPIs
Netflix marketing plan
Culture Code: Creating A Lovable Company
Culture
Ad

Similar to Best Practices for Large-Scale Websites -- Lessons from eBay (20)

PDF
Ebay架构原则
PDF
Qcon best practices for scaling websites
PDF
E Bay Best Practices For Scaling Websites
PPT
Best Practices for Large-Scale Web Sites
PDF
Randy Shoup eBays Architectural Principles
PPTX
Building a highly scalable and available cloud application
PPTX
The challenges of live events scalability
PDF
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
PPTX
Scaling Systems: Architectures that grow
PPTX
Cloud First Architecture
PDF
Everware cbdi opposites attract 04-12-11
PPT
7 Stages of Scaling Web Applications
PDF
Building data intensive applications
PDF
Opposites Attract SOA, Agile, MDA
PPTX
Data stream processing and micro service architecture
PPTX
Azure architecture design patterns - proven solutions to common challenges
PPTX
Developer To Architect
PDF
Scalability Design Principles - Internal Session
PDF
Scalability designprinciples-v2-130718023602-phpapp02 (1)
PPTX
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
Ebay架构原则
Qcon best practices for scaling websites
E Bay Best Practices For Scaling Websites
Best Practices for Large-Scale Web Sites
Randy Shoup eBays Architectural Principles
Building a highly scalable and available cloud application
The challenges of live events scalability
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Scaling Systems: Architectures that grow
Cloud First Architecture
Everware cbdi opposites attract 04-12-11
7 Stages of Scaling Web Applications
Building data intensive applications
Opposites Attract SOA, Agile, MDA
Data stream processing and micro service architecture
Azure architecture design patterns - proven solutions to common challenges
Developer To Architect
Scalability Design Principles - Internal Session
Scalability designprinciples-v2-130718023602-phpapp02 (1)
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011

More from Randy Shoup (15)

PPTX
Anatomy of Three Incidents -- Commonalities and Lessons
PPTX
One Terrible Day at Google, and How It Made Us Better
PPTX
Scaling Your Architecture for the Long Term
PPTX
Minimal Viable Architecture - Silicon Slopes 2020
PPTX
Moving Fast at Scale
PPTX
Breaking Codes, Designing Jets, and Building Teams
PPTX
Scaling Your Architecture with Services and Events
PPTX
Learning from Learnings: Anatomy of Three Incidents
PPTX
Minimum Viable Architecture - Good Enough is Good Enough
PPTX
Managing Data at Scale - Microservices and Events
PPTX
Ten Lessons of the DevOps Transition
PPTX
Managing Data in Microservices
PPTX
Effective Microservices In a Data-centric World
PPTX
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
PPTX
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
Anatomy of Three Incidents -- Commonalities and Lessons
One Terrible Day at Google, and How It Made Us Better
Scaling Your Architecture for the Long Term
Minimal Viable Architecture - Silicon Slopes 2020
Moving Fast at Scale
Breaking Codes, Designing Jets, and Building Teams
Scaling Your Architecture with Services and Events
Learning from Learnings: Anatomy of Three Incidents
Minimum Viable Architecture - Good Enough is Good Enough
Managing Data at Scale - Microservices and Events
Ten Lessons of the DevOps Transition
Managing Data in Microservices
Effective Microservices In a Data-centric World
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence

Best Practices for Large-Scale Websites -- Lessons from eBay

  • 1. Best Practices for Scaling Websites Lessons from eBay Randy Shoup eBay Distinguished Architect QCon Tokyo 2009.04.10
  • 2. Challenges at Internet Scale eBay users trade over $2000 in goods every second -- $60 billion per year  eBay site stores and actively uses over 2 PB of data eBay processes 50 TB of new, incremental data per day eBay Data Warehouse analyzes 50 PB per day eBay manages … 86.3 million active users worldwide 120 million items for sale in 50,000 categories Over 2 billion page views per day In a dynamic environment 300+ features per quarter We roll 100,000+ lines of code every two weeks In 39 countries, in 8 languages, 24x7x365 >48 Billion SQL executions/day!
  • 3. Architectural Forces at Internet Scale Scalability Resource usage should increase linearly (or better!) with load Design for 10x growth in data, traffic, users, etc. Availability Resilience to failure Rapid recoverability from failure Graceful degradation Latency User experience latency Data / execution latency Manageability Simplicity Maintainability Diagnostics Cost Development effort and complexity Operational cost (TCO)
  • 4. Best Practices for Scaling Partition Everything Asynchrony Everywhere Automate Everything Remember Everything Fails Embrace Inconsistency
  • 5. Best Practice 1: Partition Everything Split every problem into manageable chunks Split by data, load, or usage characteristics “ If you can’t split it, you can’t scale it” Motivations Scalability Can scale different segments independently Can scale by adding more partitions Availability Can isolate failures to a single segment or partition Manageability Can independently configure or upgrade a single partition Cost Can choose partition size to maximize price-performance
  • 6. Best Practice 1: Partition Everything Pattern: Functional Segmentation Segment processing into pools, services, and stages Segment data by entity and usage characteristics Pattern: Horizontal Split Load-balance processing Within a pool, all servers are created equal Split (or “ shard ”) data along primary access path Partition by modulo of a key, range, lookup, etc. Corollary: No Session State User session flow moves through multiple application pools Absolutely no session state in application tier Session state maintained in cookie, URL, or database
  • 7. Best Practice 2: Asynchrony Everywhere Prefer Asynchronous Processing Move as much processing as possible to asynchronous flows Where possible, integrate components asynchronously Motivations Scalability Can independently scale components A and B Availability Allows component A or B to be temporarily unavailable Can retry operations Latency Can reduce user experience latency at the cost of increasing data/execution latency Can allocate more time to processing than user would tolerate Cost Can dampen peaks in load by queueing work for later Can provision resources for the average load rather than the peak load
  • 8. Best Practice 2: Asynchrony Everywhere Pattern: Event Queue Primary application writes data and produces event Create event transactionally with primary insert/update (e.g., ITEM.NEW , ITEM.BID , ITEM.SOLD ) Consumers subscribe to event At least once delivery, rather than exactly once No guaranteed order, rather than in-order Idempotency and readback Pattern: Message Multicast Search Feeder publishes item updates Reads item updates from primary database Publishes sequenced updates via multicast to search grid Search engines listen to assigned subset of messages Update in-memory index in real time Request recovery when messages are missed
  • 9. Best Practice 3: Automate Everything Prefer Adaptive / Automated Systems to Manual Systems Motivations Scalability Can scale with machines, not humans Availability Can adapt to changing environment more rapidly Cost Machines are far less expensive than humans Can adjust themselves over time without manual effort
  • 10. Best Practice 3: Automate Everything Pattern: Adaptive Configuration Do not manually configure event consumers Number of threads, polling frequency, batch size, etc. Define SLA for a given consumer E.g., process 99% of events within 15 seconds Each consumer dynamically adjusts to meet defined SLA Consumers automatically adapt to changes in Load Event processing time Number of consumer instances Pattern: Machine Learning Dynamically adapt search experience On every request, determine best inventory and assemble optimal page for that user and context Feedback loop enables system to learn and improve over time Collect user behavior Aggregate and analyze offline Deploy updated metadata Decide and serve appropriate experience
  • 11. Best Practice 4: Remember Everything Fails Build all systems to be tolerant of failure Assume every operation will fail and every resource will be unavailable Detect failure as rapidly as possible Recover from failure as rapidly as possible Do as much as possible during failure Motivation Availability
  • 12. Best Practice 4: Remember Everything Fails Pattern: Failure Detection Servers log all requests Log all application activity, database and service calls on multicast message bus Over 2TB of log messages per day Listeners automate failure detection and notification Pattern: Rollback Absolutely no changes to the site which cannot be undone (!) Every feature has on / off state driven by central configuration Feature can be immediately turned off for operational or business reasons Features can be deployed “wired-off” Pattern: Graceful Degradation Application “marks down” a resource if it is unavailable or distressed Application removes or ignores non-critical operations Application retries critical operations or defers them to an asynchronous event
  • 13. Best Practice 5: Embrace Inconsistency Brewer’s CAP Theorem Any shared-data system can have at most two of the following properties: C onsistency: All clients see the same data, even in the presence of updates A vailability: All clients will get a response, even in the presence of failures P artition-tolerance: The system properties hold even when the network is partitioned This trade-off is fundamental to all distributed systems
  • 14. Best Practice 5: Embrace Inconsistency Choose Appropriate Consistency Guarantees Typically eBay trades off immediate consistency in order to guarantee availability and partition-tolerance Most real-world systems do not require immediate consistency (even financial systems!) Consistency is a spectrum Prefer eventual consistency to immediate consistency Avoid Distributed Transactions eBay does absolutely no distributed transactions – no two-phase commit Minimize inconsistency through state machines and careful ordering of database operations Reach eventual consistency through asynchronous event or reconciliation batch
  • 15. Concluding Messages Message for the Audience: Best Practices for Scaling Partition Everything Asynchrony Everywhere Automate Everything Remember Everything Fails Embrace Inconsistency My Dream as an Engineer To build very large distributed systems that are highly efficient, scalable, and resilient
  • 16. Questions? About the Presenter Randy Shoup has been the primary architect for eBay's search infrastructure since 2004. Prior to eBay, Randy was Chief Architect and Technical Fellow at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica. [email_address]