Architecting for the cloud cloud providers

© Matthew Bass 2013
Architecting for the Cloud
Len and Matt Bass
Cloud Providers

IaaS Providers
• There are several primary providers
– Amazon: Amazon Web Services (AWS)
– Microsoft: Azure
– Google: Google Compute Engine
– …
• Each of these are set up a bit differently with slightly different
internal decisions and associated services

Goals
• The goals for this talk is not to give you a definitive how to for
each provider
• It’s meant to give you just an introduction
• The idea is that you’ll see how the concepts that we talked
about in the course map to specific providers
• We’ll look primarily at Amazon (with some details from others
thrown in)
• We’ll go through both the overall structure and look at specific
services

Amazon Elastic Compute Cloud
• Amazon EC2 provides compute capacity in the cloud
• You can select the machine image with a given OS and specified
capability
• You can resize the capacity as needed
• Takes minutes to spin up a new VM
• You can specify multiple instances and select where they will run
– Region & availability zones
• You pay per usage/hour depending on the capability of the instance
and if it’s a reserved instance (dedicated)

Regions
• Amazon has divided their cloud offerings into multiple regions. Each region
should be thought of as a separate cloud
– I.e. there is no automatic copying of data from one region to another.

Current AWS Regions
• North America:
– US East (5 availability zones)
– US West Oregon (3 availability zones)
– US West Northern California (3 availability zones)
– USGov Cloud (2 availability zones)
• South America
– Sao Paulo (2 availability zones)
• Europe
– Ireland (2 availability zones)
• Asia Pacific
– Sydney (2 availability zones)
– Singapore (2 availability zones)
– China (1 availability zone)
– Tokyo (3 availability zones)

AWS and Services
• Amazon Web Services offers a number of services
• These services are things like:
– Storage
– Database
– Network capabilities
– Monitoring
– …
• Not all services are available at all regions
– https://aws.amazon.com/about-aws/globalinfrastructure/regional-
product-services/

Amazon Availability Zones
• Amazon has a notion of availability zones
• Engineered to be insulated from failures in other availability zones
• Availability zones are locations within a region
• Amazon has not announced the details of an availability region but presumably they
are
– Physically separate data centers
– Have independent networks
– Have independent power delivery
– …

Amazon Service Level Agreement
• Amazon guarantees 99.95% availability for each region
• IaaS consumers are free to deploy their applications:
– Within an availability zone
– Across availability zones but within a region
– Across regions
• Amazon does not make any claim about the availability of their availability zones
(that I could find)

All-in-one Single Server

Basic 4-server Setup

Multiple Availability Zones

Multiple Regions

Elastic Compute Cloud (EC2) & Redundancy
• EC2 supports different levels of redundancy
– It is up to the customer to determine how much redundancy they
wish to have and how much they wish to pay for it
• Redundant elements can be:
– Within an availability zone
– Across availability zones
– Across regions

Microsoft Azure Regions
• North America
– US Central (Iowa)
– US East (Virginia)
– US East 2 (Virginia)
– US North Central (Illinois)
– US South Central (Texas)
– US West (California)
• Europe
– Europe North (Ireland)
– Europe West (Netherlands)
• Asia Pacific
– East (Hong Kong)
– Southeast (Singapore)
• Japan
– Japan East (Saitama)
– Japan West (Osaka)
• Brazil
– Sao Paulo

Fault Domains in Azure
• In Azure there is the concept of Fault Domains
• A Fault Domain is essentially a rack in a given datacenter
• A consumer is not able to define which fault zones the
application are distributed to
– Unlike an availability zone
• As a result the fault zone is really an internal structure

Upgrade Domains in Azure
• An upgrade domain is similar to a fault domain
• Essentially an upgrade domain will be upgraded at one time
– When Microsoft upgrades their internal infrastructure they do so a
domain at a time
• In order to guard against failures within a fault domains and
upgrades you need to replicate across both fault and upgrade
domains
• This is called an availability set

Azure Availability Sets

Amazon Auto Scaling
• Auto Scaling works in conjunction with Cloudwatch (Amazon’s monitoring
service)
• The idea is the monitoring service monitors the metrics
– CPU utilization
– Latency
– Memory consumption
• The Auto Scaling solution establishes the rules
– Add instances when utilization exceeds 70%
– Remove instances when utilization falls below 10%
• You can specify things like a “cooling off” period
– Where no action is taken until the system has a chance to stabilize

Amazon Elastic Load Balancer
• This is Amazon’s load balancing solution
– Recall the push/pull architecture discussion
• It tracks the status and location of instances
• Routes requests to healthy instances based on criteria that you establish
• Can be used in conjunction with Auto Scaling
– When new instances are added or removed they are registered with the ELB
• Can use in conjunction with Amazon’s DNS (route 53)
– You can use DNS failover to move from one region to another
– The DNS will route traffic to the ELB in the target region

Amazon Simple Queue Service
• SQS is Amazon’s queuing service
– Again recall the push/pull architecture discussion
• It’s a service that supports message queues
• Recall it can be used in conjunction with Auto Scaling to
manage the elasticity of your application
• Pricing is per million requests handled

Amazon Storage Solutions
• Amazon has several storage solutions
– Elastic Block Store (EBS)
– Simple Storage Solution (S3)
– Glacier
• These provide raw unmanaged storage
• This is useful for:
– Disaster recovery
– Backup
– Archiving
– Persistence for your own database solution

Amazon Elastic Block Store
Amazon Elastic Block Store (EBS) is Amazon’s data file system.
Some of its features are
• Data is persisted independently from instances
• EBS data is placed in a specific availability zones and can be attached to instances in
the same availability zone
• EBS data is automatically replicated within availability zone
• There are two networks that connect EBS instances
– A high speed network to provide coordination among instances and move data between
instances.
– A lower speed network used as backup for coordination.
• $0.05 per million I/O requests

Amazon Simple Storage Solution (S3)
• S3 is a scalable storage solution
• Good for content storage and distribution
• Good for backup, archiving, and disaster recovery
• Costs $0.03 per GB of data
• More expensive but faster than Glacier
• Not as fast for I/O as EBS

Amazon Glacier
• Low cost storage solution
• Good for off site archival of Enterprise data
• Good for backup and data archiving
• Good for large volumes of data
• Costs $0.01 per GB of data

Amazon Database Solutions
• Amazon has a number of fully managed database solutions
• These are built on top of one of Amazon’s storage solutions
• They include:
– DynamoDB
– Relational Data Store (RDS)
– Redshift
– ElastiCache

DynamoDB
• Key Value data store
• Uses a throughput oriented pricing model (rather than a
storage oriented model)
• Uses solid state drives
• Guarantees single digit read latencies
• You pay a flat hourly rate based on capacity that you reserve
– Costs $0.0065 per hour for every 10 units of write capacity
– Costs $0.0065 per hour for every 10 unites of read capacity

Relational Data Store
• A distributed relational web service that provides a
relational database for use in applications
• It provides access to MySQL, Oracle, SQL Server, or
PostgreSQL
• It simplifies installation, patching, and backup related
issues
• Priced per hour according to db type, size, and number

Redshift
• Redshift is Amazon’s data warehousing solution
• Integrates with other storage solutions
• Priced at either $0.25 per hour on the low end
• $1000/year per terabyte per year

ElastiCache
• A Web Service that enables an in memory data cache
• Supports:
– Memcached
– Redis
• Improves latency and throughput for read heavy applications
• Prices are per Cache node/hour

Amazon CloudFront
• Amazon’s content delivery network
• Provides edge services
– Competes with companies such as Akamai
• This service will allow you to locate content closer to users
– Reduces latency
• You specify the edge location and point it to the origin
• You can route DNS to the edge location if you want

Amazon Elastic IP Addressing
• Amazon provides elastic IP addressing
• The IP address is associated with your account – not with an
instance
• You can programmatically map the elastic IP to any instance in
your account
• In this way you make the deployment configuration
transparent to the user/application
– Remember the virtual network discussion?

Many Other Services Available
• Authentication services
• Analytics
• Elastic Map Reduce
• Real time data streaming and processing
• Business process automation services
• Email services
• Notification services
• …

Comparison to Other Providers
• Other major providers (Google, Microsoft, Rackspace) offer
similar services
• Google doesn’t have as many services but has different pricing
model
– Charges in 10 minute increments rather than one hour increment
• Microsoft has similar services
• Rackspace also provides comparable options

Outages
• In Amazon (and others) there are some kinds of outages that
are specific to the structure of the provider
• We will now look at some of these outages

Zone Failure
• All of the IaaS providers have some notion of an “availability zone”
• An availability zone (or fault domain in Azure) has it’s own switch,
router, and rack
• These availability zones are isolated from each other in a way that
nodes within an availability zone are not

Zone Failure Modes
• A zone can fail in different ways
Zone 1 Zone 2 Zone 3
Region

Complete Failure
• If for example you have a power outage you’ll have a complete
failure
• If you try to route traffic to any of these machines you’ll get a “no
route to host”
– This happens quickly – fast fail
• You’ll know the zone is out
• You can then spin up a new zone elsewhere

Zone Failure Modes
• You could have a network failure
Region

Network Failure
• If you have a network failure it’s typically not a complete failure
• The machines are still working but the network is having trouble
• There is often still a route to host but your data isn’t reaching the
host
• As a result you don’t get a fast fail
– You’ll get long timeouts

Network Failure
• With the long timeouts your system will start to back up
• It’s difficult to tell the difference between this issue and other
issues that result in latency lags
• This problem can be intermittent as some of the routers might be
down but not all

Zone Failure Modes
• You could have a failure of some zone service
Region

Zone Service Failure
• This is some when a service fails that the zone is dependent on
– It could be something that is part of the platform as a service (e.g.
EBS)
– It could also be a central service in your application
• This causes cascading failures
• Difficult to figure out what is going on

Region Failure
• It’s rare but a Region can fail as well
• Both complete and partial failures have happened
• Typically this starts with isolated issues that cascade
• There might be an issue with a few nodes or with a single availability zone
• Other zones become impacted (often due to additional traffic) and fail
– It can be difficult to determine the scope of the issue while it’s occurring

Regional Failure Modes
• You could loose network access to a region
Region

Regional Outage
• This is often caused by
– a DNS issue
– Router issues
– Network capacity overload
• Causes you to loose access to a region

Regional Failure Modes
• Local failures can cause a control plane overload
Region

Data Store Failure
• As with the other portions of the system the data store can become
unresponsive
• The remedy for this is typically to mark this node as bad and attempt to
bring a new node online
• If the issue is more pervasive it can result in:
– Disrupted availability
– Loss of persistent data

Backup Failure
• Systems will often have a backup data mechanism
• This is often a key component in disaster recovery
• This can also fail
– It can become temporarily or permanently unavailable

Upgrades
• Cloud providers need to upgrade their software as well
• When they do this the nodes that are being upgraded
experience an outage
• If your software is running on these nodes you might
experience an outage as well

Utilizing AWS
• You can utilize AWS in many ways
– You can host your entire application in the cloud
– You can host a specific portion of your application in the cloud
– You can use the cloud for a specialized need

Hosting Your Application
• You can have a system that is fully deployed in the cloud
• You’ll need to figure out how to structure the application to achieve both functional and quality
attribute needs
• You’ll want to first consider quality attribute concerns such as:
– Scalability
– Availability
– Security
– …
• Utilize the techniques we talked about to determine the needs
– Fault modeling (considering the cloud specific faults)
– Threat modeling
– Understanding the anticipated load and desired throughput and latency
• Come up with a gross structure that achieves your objectives
– Think about partitioning of the system to support testing, degraded modes of operation and independent
deployment

Partial Hosting
• You might want to leverage the cloud for a specific portion of your
system e.g.
– Supporting mobile applications
– Databases
– Analytics
– Delivery of particular content
– Hosting your front end
– …
• This is typically going to be driven by cost and quality attribute
needs (e.g. scalability)

Backup and Recovery
• Many organizations utilize the cloud for bulk storage, archiving,
or back up and recovery
• In the past external services were used for such needs
– They often stored data on tape in separate physical locations
• It can be cheaper and more convenient to utilize cloud services
• As a result many organizations use the cloud for such storage
needs

Summary
• Many services are available in the cloud
– Storage
– Network
– Compute related services
– …
• These services provide different levels of service at different pricing
levels
• Utilizing the cloud appropriately and efficiently takes an explicit
understanding of both your needs and the services available

Architecting for the cloud cloud providers

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Architecting for the cloud cloud providers (20)

More from Len Bass (20)

Recently uploaded (20)

Architecting for the cloud cloud providers