SlideShare a Scribd company logo
Designing for the Cloud:A TutorialStuart Charlton, CTO, Elastra
Tutorial ObjectivesWhat has cloud computing done to IT systems design & architecture?“The future is already here, it’s just not very evenly distributed” (Gibson)How should new systems be designed with the new constraints?Such as:  parallelism, availability, on demand infraWhere can I find are practical frameworks, tools, and techniques, and what are the tradeoffs?Hadoop, Cassandra, Parallel DBs, Actors, Caches, Containers, and Configuration Management
About Your PresenterStuart CharltonCanadian, now in San FranciscoCTO, ElastraFocus on Customers, Products, Technology DirectionsIn prior lives... BEA Systems, Rogers Communications, Financial Services,global training & consultingRESTafarian and Data geekStu Says Stuffhttp://stucharlton.com/blog
Tutorial Agenda, in 4 WordsCloudsServiceDataControl4
Agenda – Part 1Clouds: Fear of a Fluffy PlanetWhat has changed, and what remains the same?Designing applications in this worldA Cloud Design Reference Architecture(aka.  A cheat sheet to categorize thinking in the clouds)Service:  Foundations for SystemsSolving Big Problems vs. Little ProblemsAmdahl’s Law & The Universal Scalability Law Actor-Based Concurrency:  Dr. Strangelanguage, (or How I Learned to Stop Worrying and Love Erlang)
Agenda – Part 2Data: Management & AccessContrasting PhilosophiesPersistence vs. Management; Scale-Up vs. Scale-OutShared Disk vs. Shared NothingA survey of solutions (from clustered DBMS to K/V stores)Consistency, Availability, Partitioning (CAP) TradeoffsDeep dig into what these really implyControl: Containers, Configuration & ModelingThe Dev/Ops Tennis MatchThe Evolution of AutomationFrom Scripts to Runbooks to FSMs to HTNs
CaveatsAudience Assumption:  IT Devs & ArchitectsSome exposure to cloud, but not necessarily advancedThe technology is a fast moving targetEspecially state of the specific tools & frameworksTheory vs. practiceI try to balance the two; both are essentialTime is limitedOnly scratching the surface of certain topicsMissing topics are usually full tutorials in their own rightMuch of the subject matter is up for debateAnd, this is a tutorial, not a workshop…. 
CloudsFear of a Fluffy Planet8
(court(Courtesy of browsertoolkit.com)
The Freedom!On Demand Infrastructure via API callsInside or outside my data centres (Private / Public Cloud)Pay-per-use pricing modelsGreat for temporary growth needsPlatform-as-a-ServiceScalability without Skill, Availability without AvariceLarge Scale, Always OnNew opportunities due to cheaper scale & availability
The Horror!Hype OverdriveCloud Running Shoes!  Cloud Chewing Gum!  GOOG!Werner Vogels Action Figures!  (well, not quite yet)Standards SupportSo many to choose from!OCCI, vCloud + OVF, EC2, WBEM, WS-ManagementPlatform-as-a-ServiceWhat color would you like for your locked trunk’s interior?Crazy TalkNo SQL!  Eventual Consistency!  Infrastructure as Code!
Will the Real Slim Cloudy Please Stand Up?“I, for one, welcome our new  outsourced overlords”Finer-grained outsourcingMetered resource usageAPIs & self-service UIs… but isn’t outsourcing often a shell game?See Distributed Computing Economics, Jim Gray (2003)“Scale without skill,   availability without avarice”Insert constrained code [here]Magically scalable & availableGAE, Azure (some day)… but aren’t you locked in?
Will the Real Slim Cloudy Please Stand Up?“I like Big *aaS and I cannot lie”“My name is… what? Slim Cloudy!”Private, Public, or Community CloudsMultiple stack levels“Real” SOA, not just web services… haven’t I heard this before?Reduced lead times to changeAgile Operations / Lean ITRevolution in systems management… can we really change IT?
Designing Applications in this WorldDistributed & networked systems have triumphedThe fallacies must be taken seriously nowNetwork is unreliable, latency > 0, bandwidth is finite, topology might change, etc.Scale-out & fault tolerance: the new design centerVersus productive business logic, data management, etc.What’s old is newSome challengers to mainstream ideas are old ideas being reappliede.g. Erlang, Map/Reduce, distributed file systems, replication
Designing Applications in this WorldAutonomous services constitute most systemsFull-stack services, not just bits of codeDesign for constant operationsInterdependence + Distribution + Autonomy = PainFCAPS (Fault, Configuration, Accounting, Performance & Security Management) Security & PrivacyMulti-tenancy, data-in-transit vs. data-at-rest, etc.
Solving for one’s own problemsMainstream tools, platforms, and servers have not consistently caught upLOTS of software experimentation in:Web servers, containers, caches, databases, network configuration, systems managementThe danger is to view new solutions as the better way of doing things in generalIt’s possible; but stuff is changing quicklyNew territory always involves a level of reinventionThe tech world has not rebooted due to cloud computingBeware Fanbois/Fangrrls, Pundits & The Press
A Cloud Design Reference ArchitectureWeb – WebArch & RESTService, Data,& Control – this tutorialResource –virtualization,management &infrastructure cloudsWEBSERVICEDATACONTROLRESOURCE
ServiceOrganizing your computing domain forfaultscalemanagementWEBSERVICEDATACONTROLRESOURCE
DataStorage, retrieval,integrity, recovery givenDistributed systemsLarge scaleHigh availability(possible) Multi-tenancyWEBSERVICEDATACONTROLRESOURCE
ControlProvision, configuration, governance, and optimization of infrastructureResource brokeragePolicy constraintsDependency managementSoftware configurationAuthorization & AuditabilityWEBSERVICEDATACONTROLRESOURCE
ServiceFoundation for Systems
Designing a Service, circa 1998-2008Multi-Tier Hybrid ArchitectureSome stateless, some stateful computingSession state is replicatedIndependent servers / applicationsLow-level redundancy (RAID, 2x NICs, etc.)“Put your eggs into a small number of baskets, and watch those baskets”General assumptionsFailure at the service layer shouldn’t lead to downtimeFailure at the data layer may be catastrophic
Designing a Service, circa 2008+Autonomous services Divide system into areas of functional responsibility (tiers irrelevant)Interdependent servers / applicationsSoftware-level redundancy andfault handling “Many, many servers breaking big problems down or distributinglots of little problems around”New realitiesPartial failure is a regular, normal occurrence; no excuse for downtime from any service
Breaking or bridging a problem across resourcesBig Problems (Parallel)Theory:Amdahl’s lawShared memory or disk vs. Shared nothingNew Practice:MapReduce (e.g. Hadoop), Spaces, Master/WorkerRetro:  Linda, MPI, OpenMP, IPC or ThreadsLittle Problems (Concurrent)Theory:  Actor-model & process calculiNew Practice:   Lightweight Messaging, Spaces, Erlang & Scala ActorsRetro:   IPC, Thread pools,Components (COM+/EJB),Big Messaging (MQ, TIB, JMS)
Case Study in “Big Problem” Solving:MapReduce & Apache HadoopInputRead your data from files as a K/V mapDistribute Mapping FunctionInput one (k,v) pairreturns new K/V listPartition & SortHandled by framework (eg. Hadoop)Provide a comparatorDistribute Reduce FunctionInput one (k, list of values) pairReturn a list of output valuesOutputSave the list as a file
….But how fast can I get?Theory Interlude:  Amdahl’s LawHow fast can I speed up a sequential process?Time = Serial part + Parallel part Thus, the speed up isWhere P is the % of the program that can be parallelN is the number of processorsWhat happens when P is 95%? -- Maximum of 20x  How about 99.99%?
Gunther’s Universal Scalability LawIt gets worse…Most scale-outexperiencesretrogradebehavior at peak loadsCapacity(N)  =                                   N         1 + α (N − 1) + β N (N − 1)	α is the contention β is the coherency delayhttp://www.perfdynamics.com/Manifesto/gcaprules.html
Case study in solving “little problems”Actors:   The Basic IdeaProgrammable entities are concurrent, share nothing, communicate through messagesActors canSend messagesCreate other actorsSpecify how it responds to messagesVery lightweight (actors = objects)Usually no ordering guaranteesAt the language level
ErlangSupervisors: Assuming failure will occurFailures require cleanup & restartSupervisor relationships canensure the systemtolerates faultsHot-swap patchesFundamentally inthe language libraries
What kinds of failures?  A Simplification.Exceptional ConditionsConditions that a programmer did not or should not handleTolerated through replication, fast failure, and/or restart(s)ExamplesHardware failures, network outages, “Heisenbugs”, rare software conditionsConditions that the programmer can handleHandled through cleanup or “catch” codeExamplesFile not found, type conversion, bad arithmetic (divide by zero),malformed inputError Conditions
DataManagement & Access
Evolving the Database:  Two PhilosophiesData Persistence Systemsand FrameworksDatabase Management Systems(DBMS)Goal:  Store & retrieve data quickly, reliable, with minimal hassle to the programmerOften uses application tools & languages to manage & access dataFocused set of featuresGoal:  Manage the access, integrity, security, and reliability of data, independently of applicationsHard separation of tools & languages (e.g. SQL, DBA tools)Broad set of features
Scaling the Database:  Two PhilosophiesScale-UpScale-OutConcurrent processing & parallelism through hardwareSMP, NUMA, MPPRAID Arrays (SAN & NAS)Shared disk or memoryBenefit:  It worked in the 90s.Drawback:  Expensive, often bespoke, forklift upgradesConcurrent processing & parallelism through softwareCommodity hardwareSoftware provides the engineShared nothingBenefit:  Linear scale, easy to standardize, easy to replicate / upgradeDrawback:  Traditionally, the software sucked.33
… What happens when database clustering software stops sucking? (i.e. now)A flurry of programmer-oriented approachesPersistence engines rule the bleeding edge in 2009Key/Value Stores, JSON Document stores, etc.Declarative/Imperative impedance mismatch(the “Vietnam” of the software tools industry) gets conflated with distributed dataLots of practical confusionWhat are the tradeoffs with a widely scaled out database system?
Too many choices, with idiosyncratic design histories
Let’s detangle this…34
When should I share components?Shared DiskShared NothingPartition compute across nodesStorage is shared through NAS or SANGood for:Mixed workloadSmall random access readsWorst case:Inter-node network chatter caps scalabilityDisk pings to propagate writes (e.g. Oracle pre-RAC)Partition data across nodesEach node owns its dataGood for:Read-mostlyParallel reads of huge data volumesConsistent writes go to one partitionWorst case:RepartitioningHotspot records don’t scaleWrites that span partitions
Modern Data Persistence Systems Object Persistence“Navigational databases in Java, Smalltalk, C++”GemStone, Versant, ObjectivityDistributed Key-Value Stores“Structured data with lesser need for complex queries”Consistent:   BigTable, HBase, VoldemortEventually Consistent:  Dynamo, CassandraDocument and/or Blob Stores“Indexed structured data + binaries/fulltext”CouchDB, BerkeleyDB, MongoDB
Clustered DBMS for TransactionsOracle Real Application Clusters (RAC)Shared disk, Replicated Memory (“Cache Fusion”)Limited by mesh interconnect to disk (partitioning possible)IBM DB2 Data Partitioning FeatureShared nothing database cluster, high number of nodesIBM DB2 pureScaleNew (Oct 2009) technology that ports IBM DB2 mainframe shared-disk clustering to the DB2 for open systemsMicrosoft SQL Server 2008“Federated” Shared Nothing Database a longtime feature
Clustered DBMS for Parallel QueriesTeradataThe old standard data warehouse, hardware + softwareNetezzaData warehousing appliance (hw + software)VerticaColumn-oriented, shared nothing clustered databaseMike Stonebraker’s new companyGreenplumColumn-oriented, shared nothing clustered databaseBased on PostgreSQL with MapReduce engine
Scaling to Internet-ScaleSingle Control DomainOne Database SiteConsistency is built-inScalable with tradeoffs among different workloadsScale to the limits of network bandwidth & manageabilityMain Example:Clustered DBMSMultiple Control DomainsMany Database SitesConsistency requires agreement protocolScalable only if consistency is relaxedNearly limitless (global) scaleMain Examples:DNS The Web39
How do I make consistency tradeoffs?Theory interlude:  The CAP theoremConsistency (A+C in ACID)There’s a total orderingon all operations on the data;i.e. like a sequenceAvailabilityEvery request onnon-failed servers must havea responseTolerance to Network PartitionsAll messages might be lost between server nodesChoose at most two of these (as a spectrum).
CAP Tradeoffs:  Consistency & AvailabilityThe common case.
 Fault tolerance through replicas   & fast fail + fast recoveryImplication:
 network outage between servers might halt the system
 generally requires a single domainof control
Examples that emphasize C+A:
 Single-site cluster databases
 Google BigTable
Hadoop’sHBase
 Oracle RAC, IBM DB2 Parallel
 Clustered file systems
Google File System & HDFS
 Distributed Spaces & Caches
Coherence, Gigaspaces & TerracottaCAP Tradeoffs:  Consistency & PartitionsCommon approach for traditional distributed systems
Implication:

More Related Content

PDF
Above the Clouds: A Berkeley View of Cloud Computing: Paper Review
PPTX
No more Three Tier - A path to a better code for Cloud and Azure
PPT
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
PDF
Introductions to Online Machine Learning Algorithms
PPTX
Cloud Computing
PPT
Daniel Abadi: VLDB 2009 Panel
PPTX
PDF
Shared slides-edbt-keynote-03-19-13
Above the Clouds: A Berkeley View of Cloud Computing: Paper Review
No more Three Tier - A path to a better code for Cloud and Azure
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Introductions to Online Machine Learning Algorithms
Cloud Computing
Daniel Abadi: VLDB 2009 Panel
Shared slides-edbt-keynote-03-19-13

What's hot (14)

PDF
Infrastructure and Tooling - Full Stack Deep Learning
PPT
Above the Clouds: A View From Academia
PPTX
Cloud Computing - Demystified
PPT
Going eXtreme for Healthcare
PPTX
2015 04 bio it world
PDF
High Performance Computing
PPTX
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
PPT
Oracle migrations and upgrades
DOCX
Jon cohn exton pa corporate data architecture
PPTX
ICPSR Secure Data Service: Broadening Access. Reducing Risk.
PDF
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
PPTX
Robin Gadd on Cloud Computing in FE at AoC Annual Conference Birmingham Nov 2010
PDF
Adoption of Cloud Computing in Scientific Research
PPTX
PLNOG 17 - Shabbir Ahmad - Dell Open Networking i Big Monitoring Fabric: unik...
Infrastructure and Tooling - Full Stack Deep Learning
Above the Clouds: A View From Academia
Cloud Computing - Demystified
Going eXtreme for Healthcare
2015 04 bio it world
High Performance Computing
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Oracle migrations and upgrades
Jon cohn exton pa corporate data architecture
ICPSR Secure Data Service: Broadening Access. Reducing Risk.
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Robin Gadd on Cloud Computing in FE at AoC Annual Conference Birmingham Nov 2010
Adoption of Cloud Computing in Scientific Research
PLNOG 17 - Shabbir Ahmad - Dell Open Networking i Big Monitoring Fabric: unik...
Ad

Similar to Designing for the Cloud Tutorial - QCon SF 2009 (20)

PDF
Azure and cloud design patterns
PPT
The Enterprise Cloud
PDF
Migrate to Microservices Judiciously!
PPTX
Melbourne Microservices Meetup: Agenda for a new Architecture
PPTX
Above the cloud joarder kamal
PDF
Intro to SW Eng Principles for Cloud Computing - DNelson Apr2015
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PDF
Cloud computing - an architect's perspective
PDF
Modern software architectures - PHP UK Conference 2015
PDF
"Portrait of the developer as The Artist" Lockheed Architect Workshop
PPT
The Economies of Scaling Software
PPSX
Computer project
PPTX
Architecting with a 'cloud first' mindset
PPTX
Technology insights: Decision Science Platform
PPT
The economies of scaling software - Abdel Remani
ODP
Cloud Computing ...changes everything
PPTX
Iot cloud service v2.0
PPT
Google Cloud Computing on Google Developer 2008 Day
PDF
Distributed Systems in Data Engineering
PPT
Pattern-Oriented Distributed Software Architectures
Azure and cloud design patterns
The Enterprise Cloud
Migrate to Microservices Judiciously!
Melbourne Microservices Meetup: Agenda for a new Architecture
Above the cloud joarder kamal
Intro to SW Eng Principles for Cloud Computing - DNelson Apr2015
UnConference for Georgia Southern Computer Science March 31, 2015
Cloud computing - an architect's perspective
Modern software architectures - PHP UK Conference 2015
"Portrait of the developer as The Artist" Lockheed Architect Workshop
The Economies of Scaling Software
Computer project
Architecting with a 'cloud first' mindset
Technology insights: Decision Science Platform
The economies of scaling software - Abdel Remani
Cloud Computing ...changes everything
Iot cloud service v2.0
Google Cloud Computing on Google Developer 2008 Day
Distributed Systems in Data Engineering
Pattern-Oriented Distributed Software Architectures
Ad

More from Stuart Charlton (15)

PDF
Applied tactics for your transformation
PPTX
Cloud Foundry Vancouver Meetup July 2016
PDF
Platform Clouds, Containers, Immutable Infrastructure Oh My!
PDF
The Cloud Foundry Story on OpenStack
PDF
Deploying to Production 50+ Times a Day - Calgary Agile Users Group 2015
PDF
Speeding up enterprises, one deploy at a time - Devopsdays Toronto 2014
PPT
Linking Data and Actions on the Web
PDF
I'll See You On the Write Side of the Web
PPTX
From Agile Development to Agile Operations (QCon SF 2009)
PPTX
OOPSLA Cloud Workshop - Designing for the Cloud (Elastra)
PPTX
Software Licensing In The Cloud (CloudWorld 2009)
PDF
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
PPTX
Cloud Computing and the Next-Generation of Enterprise Architecture - Cloud Co...
PDF
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
PPT
Oopsla 2007 - The Web: Distributed Objects Realized!
Applied tactics for your transformation
Cloud Foundry Vancouver Meetup July 2016
Platform Clouds, Containers, Immutable Infrastructure Oh My!
The Cloud Foundry Story on OpenStack
Deploying to Production 50+ Times a Day - Calgary Agile Users Group 2015
Speeding up enterprises, one deploy at a time - Devopsdays Toronto 2014
Linking Data and Actions on the Web
I'll See You On the Write Side of the Web
From Agile Development to Agile Operations (QCon SF 2009)
OOPSLA Cloud Workshop - Designing for the Cloud (Elastra)
Software Licensing In The Cloud (CloudWorld 2009)
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Cloud Computing and the Next-Generation of Enterprise Architecture - Cloud Co...
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Oopsla 2007 - The Web: Distributed Objects Realized!

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
Teaching material agriculture food technology
PDF
Advanced IT Governance
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Teaching material agriculture food technology
Advanced IT Governance
GamePlan Trading System Review: Professional Trader's Honest Take
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation

Designing for the Cloud Tutorial - QCon SF 2009

  • 1. Designing for the Cloud:A TutorialStuart Charlton, CTO, Elastra
  • 2. Tutorial ObjectivesWhat has cloud computing done to IT systems design & architecture?“The future is already here, it’s just not very evenly distributed” (Gibson)How should new systems be designed with the new constraints?Such as: parallelism, availability, on demand infraWhere can I find are practical frameworks, tools, and techniques, and what are the tradeoffs?Hadoop, Cassandra, Parallel DBs, Actors, Caches, Containers, and Configuration Management
  • 3. About Your PresenterStuart CharltonCanadian, now in San FranciscoCTO, ElastraFocus on Customers, Products, Technology DirectionsIn prior lives... BEA Systems, Rogers Communications, Financial Services,global training & consultingRESTafarian and Data geekStu Says Stuffhttp://stucharlton.com/blog
  • 4. Tutorial Agenda, in 4 WordsCloudsServiceDataControl4
  • 5. Agenda – Part 1Clouds: Fear of a Fluffy PlanetWhat has changed, and what remains the same?Designing applications in this worldA Cloud Design Reference Architecture(aka. A cheat sheet to categorize thinking in the clouds)Service: Foundations for SystemsSolving Big Problems vs. Little ProblemsAmdahl’s Law & The Universal Scalability Law Actor-Based Concurrency: Dr. Strangelanguage, (or How I Learned to Stop Worrying and Love Erlang)
  • 6. Agenda – Part 2Data: Management & AccessContrasting PhilosophiesPersistence vs. Management; Scale-Up vs. Scale-OutShared Disk vs. Shared NothingA survey of solutions (from clustered DBMS to K/V stores)Consistency, Availability, Partitioning (CAP) TradeoffsDeep dig into what these really implyControl: Containers, Configuration & ModelingThe Dev/Ops Tennis MatchThe Evolution of AutomationFrom Scripts to Runbooks to FSMs to HTNs
  • 7. CaveatsAudience Assumption: IT Devs & ArchitectsSome exposure to cloud, but not necessarily advancedThe technology is a fast moving targetEspecially state of the specific tools & frameworksTheory vs. practiceI try to balance the two; both are essentialTime is limitedOnly scratching the surface of certain topicsMissing topics are usually full tutorials in their own rightMuch of the subject matter is up for debateAnd, this is a tutorial, not a workshop…. 
  • 8. CloudsFear of a Fluffy Planet8
  • 10. The Freedom!On Demand Infrastructure via API callsInside or outside my data centres (Private / Public Cloud)Pay-per-use pricing modelsGreat for temporary growth needsPlatform-as-a-ServiceScalability without Skill, Availability without AvariceLarge Scale, Always OnNew opportunities due to cheaper scale & availability
  • 11. The Horror!Hype OverdriveCloud Running Shoes! Cloud Chewing Gum! GOOG!Werner Vogels Action Figures! (well, not quite yet)Standards SupportSo many to choose from!OCCI, vCloud + OVF, EC2, WBEM, WS-ManagementPlatform-as-a-ServiceWhat color would you like for your locked trunk’s interior?Crazy TalkNo SQL! Eventual Consistency! Infrastructure as Code!
  • 12. Will the Real Slim Cloudy Please Stand Up?“I, for one, welcome our new outsourced overlords”Finer-grained outsourcingMetered resource usageAPIs & self-service UIs… but isn’t outsourcing often a shell game?See Distributed Computing Economics, Jim Gray (2003)“Scale without skill, availability without avarice”Insert constrained code [here]Magically scalable & availableGAE, Azure (some day)… but aren’t you locked in?
  • 13. Will the Real Slim Cloudy Please Stand Up?“I like Big *aaS and I cannot lie”“My name is… what? Slim Cloudy!”Private, Public, or Community CloudsMultiple stack levels“Real” SOA, not just web services… haven’t I heard this before?Reduced lead times to changeAgile Operations / Lean ITRevolution in systems management… can we really change IT?
  • 14. Designing Applications in this WorldDistributed & networked systems have triumphedThe fallacies must be taken seriously nowNetwork is unreliable, latency > 0, bandwidth is finite, topology might change, etc.Scale-out & fault tolerance: the new design centerVersus productive business logic, data management, etc.What’s old is newSome challengers to mainstream ideas are old ideas being reappliede.g. Erlang, Map/Reduce, distributed file systems, replication
  • 15. Designing Applications in this WorldAutonomous services constitute most systemsFull-stack services, not just bits of codeDesign for constant operationsInterdependence + Distribution + Autonomy = PainFCAPS (Fault, Configuration, Accounting, Performance & Security Management) Security & PrivacyMulti-tenancy, data-in-transit vs. data-at-rest, etc.
  • 16. Solving for one’s own problemsMainstream tools, platforms, and servers have not consistently caught upLOTS of software experimentation in:Web servers, containers, caches, databases, network configuration, systems managementThe danger is to view new solutions as the better way of doing things in generalIt’s possible; but stuff is changing quicklyNew territory always involves a level of reinventionThe tech world has not rebooted due to cloud computingBeware Fanbois/Fangrrls, Pundits & The Press
  • 17. A Cloud Design Reference ArchitectureWeb – WebArch & RESTService, Data,& Control – this tutorialResource –virtualization,management &infrastructure cloudsWEBSERVICEDATACONTROLRESOURCE
  • 18. ServiceOrganizing your computing domain forfaultscalemanagementWEBSERVICEDATACONTROLRESOURCE
  • 19. DataStorage, retrieval,integrity, recovery givenDistributed systemsLarge scaleHigh availability(possible) Multi-tenancyWEBSERVICEDATACONTROLRESOURCE
  • 20. ControlProvision, configuration, governance, and optimization of infrastructureResource brokeragePolicy constraintsDependency managementSoftware configurationAuthorization & AuditabilityWEBSERVICEDATACONTROLRESOURCE
  • 22. Designing a Service, circa 1998-2008Multi-Tier Hybrid ArchitectureSome stateless, some stateful computingSession state is replicatedIndependent servers / applicationsLow-level redundancy (RAID, 2x NICs, etc.)“Put your eggs into a small number of baskets, and watch those baskets”General assumptionsFailure at the service layer shouldn’t lead to downtimeFailure at the data layer may be catastrophic
  • 23. Designing a Service, circa 2008+Autonomous services Divide system into areas of functional responsibility (tiers irrelevant)Interdependent servers / applicationsSoftware-level redundancy andfault handling “Many, many servers breaking big problems down or distributinglots of little problems around”New realitiesPartial failure is a regular, normal occurrence; no excuse for downtime from any service
  • 24. Breaking or bridging a problem across resourcesBig Problems (Parallel)Theory:Amdahl’s lawShared memory or disk vs. Shared nothingNew Practice:MapReduce (e.g. Hadoop), Spaces, Master/WorkerRetro: Linda, MPI, OpenMP, IPC or ThreadsLittle Problems (Concurrent)Theory: Actor-model & process calculiNew Practice: Lightweight Messaging, Spaces, Erlang & Scala ActorsRetro: IPC, Thread pools,Components (COM+/EJB),Big Messaging (MQ, TIB, JMS)
  • 25. Case Study in “Big Problem” Solving:MapReduce & Apache HadoopInputRead your data from files as a K/V mapDistribute Mapping FunctionInput one (k,v) pairreturns new K/V listPartition & SortHandled by framework (eg. Hadoop)Provide a comparatorDistribute Reduce FunctionInput one (k, list of values) pairReturn a list of output valuesOutputSave the list as a file
  • 26. ….But how fast can I get?Theory Interlude: Amdahl’s LawHow fast can I speed up a sequential process?Time = Serial part + Parallel part Thus, the speed up isWhere P is the % of the program that can be parallelN is the number of processorsWhat happens when P is 95%? -- Maximum of 20x How about 99.99%?
  • 27. Gunther’s Universal Scalability LawIt gets worse…Most scale-outexperiencesretrogradebehavior at peak loadsCapacity(N)  =   N 1 + α (N − 1) + β N (N − 1) α is the contention β is the coherency delayhttp://www.perfdynamics.com/Manifesto/gcaprules.html
  • 28. Case study in solving “little problems”Actors: The Basic IdeaProgrammable entities are concurrent, share nothing, communicate through messagesActors canSend messagesCreate other actorsSpecify how it responds to messagesVery lightweight (actors = objects)Usually no ordering guaranteesAt the language level
  • 29. ErlangSupervisors: Assuming failure will occurFailures require cleanup & restartSupervisor relationships canensure the systemtolerates faultsHot-swap patchesFundamentally inthe language libraries
  • 30. What kinds of failures? A Simplification.Exceptional ConditionsConditions that a programmer did not or should not handleTolerated through replication, fast failure, and/or restart(s)ExamplesHardware failures, network outages, “Heisenbugs”, rare software conditionsConditions that the programmer can handleHandled through cleanup or “catch” codeExamplesFile not found, type conversion, bad arithmetic (divide by zero),malformed inputError Conditions
  • 32. Evolving the Database: Two PhilosophiesData Persistence Systemsand FrameworksDatabase Management Systems(DBMS)Goal: Store & retrieve data quickly, reliable, with minimal hassle to the programmerOften uses application tools & languages to manage & access dataFocused set of featuresGoal: Manage the access, integrity, security, and reliability of data, independently of applicationsHard separation of tools & languages (e.g. SQL, DBA tools)Broad set of features
  • 33. Scaling the Database: Two PhilosophiesScale-UpScale-OutConcurrent processing & parallelism through hardwareSMP, NUMA, MPPRAID Arrays (SAN & NAS)Shared disk or memoryBenefit: It worked in the 90s.Drawback: Expensive, often bespoke, forklift upgradesConcurrent processing & parallelism through softwareCommodity hardwareSoftware provides the engineShared nothingBenefit: Linear scale, easy to standardize, easy to replicate / upgradeDrawback: Traditionally, the software sucked.33
  • 34. … What happens when database clustering software stops sucking? (i.e. now)A flurry of programmer-oriented approachesPersistence engines rule the bleeding edge in 2009Key/Value Stores, JSON Document stores, etc.Declarative/Imperative impedance mismatch(the “Vietnam” of the software tools industry) gets conflated with distributed dataLots of practical confusionWhat are the tradeoffs with a widely scaled out database system?
  • 35. Too many choices, with idiosyncratic design histories
  • 37. When should I share components?Shared DiskShared NothingPartition compute across nodesStorage is shared through NAS or SANGood for:Mixed workloadSmall random access readsWorst case:Inter-node network chatter caps scalabilityDisk pings to propagate writes (e.g. Oracle pre-RAC)Partition data across nodesEach node owns its dataGood for:Read-mostlyParallel reads of huge data volumesConsistent writes go to one partitionWorst case:RepartitioningHotspot records don’t scaleWrites that span partitions
  • 38. Modern Data Persistence Systems Object Persistence“Navigational databases in Java, Smalltalk, C++”GemStone, Versant, ObjectivityDistributed Key-Value Stores“Structured data with lesser need for complex queries”Consistent: BigTable, HBase, VoldemortEventually Consistent: Dynamo, CassandraDocument and/or Blob Stores“Indexed structured data + binaries/fulltext”CouchDB, BerkeleyDB, MongoDB
  • 39. Clustered DBMS for TransactionsOracle Real Application Clusters (RAC)Shared disk, Replicated Memory (“Cache Fusion”)Limited by mesh interconnect to disk (partitioning possible)IBM DB2 Data Partitioning FeatureShared nothing database cluster, high number of nodesIBM DB2 pureScaleNew (Oct 2009) technology that ports IBM DB2 mainframe shared-disk clustering to the DB2 for open systemsMicrosoft SQL Server 2008“Federated” Shared Nothing Database a longtime feature
  • 40. Clustered DBMS for Parallel QueriesTeradataThe old standard data warehouse, hardware + softwareNetezzaData warehousing appliance (hw + software)VerticaColumn-oriented, shared nothing clustered databaseMike Stonebraker’s new companyGreenplumColumn-oriented, shared nothing clustered databaseBased on PostgreSQL with MapReduce engine
  • 41. Scaling to Internet-ScaleSingle Control DomainOne Database SiteConsistency is built-inScalable with tradeoffs among different workloadsScale to the limits of network bandwidth & manageabilityMain Example:Clustered DBMSMultiple Control DomainsMany Database SitesConsistency requires agreement protocolScalable only if consistency is relaxedNearly limitless (global) scaleMain Examples:DNS The Web39
  • 42. How do I make consistency tradeoffs?Theory interlude: The CAP theoremConsistency (A+C in ACID)There’s a total orderingon all operations on the data;i.e. like a sequenceAvailabilityEvery request onnon-failed servers must havea responseTolerance to Network PartitionsAll messages might be lost between server nodesChoose at most two of these (as a spectrum).
  • 43. CAP Tradeoffs: Consistency & AvailabilityThe common case.
  • 44. Fault tolerance through replicas & fast fail + fast recoveryImplication:
  • 45. network outage between servers might halt the system
  • 46. generally requires a single domainof control
  • 51. Oracle RAC, IBM DB2 Parallel
  • 52. Clustered file systems
  • 55. Coherence, Gigaspaces & TerracottaCAP Tradeoffs: Consistency & PartitionsCommon approach for traditional distributed systems
  • 57. multiple domains of control
  • 58. clients can’t always read/write
  • 59. failures degrade scale & performance due to negotiation
  • 61. Distributed shared nothing databases
  • 63. Distributed locks & file systems
  • 64. Chubby & Hadoop’sZooKeeper
  • 65. Paxos & consensus protocols
  • 66. Synchronous Log ShippingCAP Tradeoffs: Partitions & AvailabilityNew approach for Internet-scale systems
  • 68. multiple domains of control
  • 69. reads & writes always succeed(eventually)
  • 70. clients may read inconsistent (old or undone) data
  • 73. Web Caching & Content Delivery Networks
  • 74. Amazon Dynamo (and clones)
  • 75. Cassandra (Facebook, Digg)
  • 77. Asynchronous log shippingSummary of the CAP TradeoffsMix & match the tradeoffs where appropriateGoogle’s search engine uses all three!The tradeoffs are a spectrum, and are not static choiceseg. there are adjustable levels of consistency to consider Strict, causal, snapshot / epoch, eventual, weak…The main tradeoff: writes to multiple sites / domains of control (with or without high availability)Single Domain (don’t tolerate network partitions), orAgreement Protocol (reduces availability), orRelaxed Consistency (stale/inaccurate data is possible)Weaker consistency is where the idea of a DBMS falters (it is contrary to its main purpose in life)
  • 78. Please don’t throw out logical/relationaldata design! (unless you have to)“Future users of massive datasets should be protected from having to know how the data is organized in the computing cloud….…. Activities of users through web agents and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”Paraphrasing Ed Codd – 39 years ago!
  • 80. The Dev / Ops Game
  • 81. Example:Why can’t these two servers communicate?Possible areas of problemsSecurityBad credentialsServer ConfigurationWrong IP or PortBad setup to listen or callNetwork ConfigurationWrong duplexBad DNS or DHCPFirewall ConfigurationPorts or protocols not open
  • 82. Example:What do I need to do to make this change?Desired ChangeScale-out this clusterBut…Impacts on other systemsSecurity SystemsLoad BalancersMonitoringCMDB / Service DeskArchitecture issuesStateful or stateless nodesRepartitioning?Limits/constraints on scale out?49
  • 83. Example:What is the authoritative reality?Desired StateConfiguration TemplateModelScriptWorkflowCMDBCodeCurrent StateOn the serverMight not be in a fileMight get changed at runtimeAnd when you do change…It may not actually changeIt might change to an undesirable settingIt might affect other settings that you didn’t think about50
  • 84. Configuration Code, Files, and ModelsBottom UpScripts & RecipesHand-grown automationRunbooksWorkflow, policyFrameworksChefPuppet, CfengineBuild Dependency SystemsMavenTop DownModeled ViewpointsE.g. Microsoft Oslo, UML, Enterprise ArchitectureModular ContainersE.g. OSGi, Spring, Azure rolesConfiguration ModelsSML, CIMECML , EDML
  • 85. An Evolution of AutomationScriptsFor automating common casesRun-Book AutomationScripts as visual workflowDeclarativeSeparate what you want from how you want it doneFinite State MachinesOrganize scripts into described states & transitionsHierarchical Task Networks (Planning)Assemble a plan by exploring hypothetical strategic paths
  • 86. An Approach to Integrated Design and Ops53
  • 87. Wrap-upCloudy, with a chance of …
  • 88. Revisiting the Cloud Design Reference ArchitectureService – Big vs. Little ProblemsMapReduce & ActorsAmdahl’s LawData – persistence vs. mgtscale-up vs. scale-outCAP tradeoffsControl –containers, configuration, automationWEBSERVICEDATACONTROLRESOURCE
  • 89. For More InformationHadoophttp://hadoop.apache.org/CAP Theorem Proof Paperhttp://people.csail.mit.edu/sethg/pubs/BrewersConjecture-SigAct.pdfGoogle’s papers on Distributed & Parallel Computinghttp://research.google.com/pubs/DistributedSystemsandParallelComputing.htmlNeil Gunther’s “Taking the Pith out of Performance” Bloghttp://perfdynamics.blogspot.com/A Comparison of Approaches to Large-Scale Data Analyticshttp://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdfModel-Driven Operations for the Cloudhttp://www.stucharlton.com/stuff/oopsla09.pdf