Google warehouse scale computer

Google Warehouse-
Scale Computer:
Hardware and
Performance
Prepared by: Tejhaskar Ashok Kumar
Master of Applied Science in Computer
Engineering
Memorial University of Newfoundland

Outline
 Introduction
 Architectural Overview of Google WSC
 Server
 Storage
 Network
 Hardware Accelerators
 GPU
 TPU
 Energy and Power Efficiency
 Performance of Google WSC
 Top-Down Micro-architectural Analysis Method
 Cooling Systems
2

Introduction
 A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process
thousands to millions of users’ request.
 A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce
 The difference between a traditional data center and a WSC:
 Data centers host services for multiple providers
 WSC run by a single organization
 WSC exploits Request Level Parallelism and Data Level Parallelism
 WSC Design Goals:
 Cost-Performance
 Energy Efficiency
 Dependability via Redundancy
 Network I/O
 Interactive and Batch processing workloads
3

Architectural Overview of Google
WSC
 In general, WSC is a building which considered as a
computer, where multiple server nodes, storage are tightly
coupled with interconnection networks.
 Inside the building, there are lot of containers and each
container contains multiple servers arranged in a rack
interconnected with storage
 Each container also has some cooling support to eliminate
the heat generated.
 A Google WSC uses more than 100,000 servers
 WSC’s are designed in a way to perform reliable and faster
to perform every request powered by internet services.
4
Fig 1 – WSC as a building

Servers
 The WSC uses a low-end server in a 1U or blade enclosure
format, and the servers are mounted in a rack and each
server is interconnected with an Ethernet switch.
 Google WSC uses a 19-inch wide rack which can hold 48
blade servers connected to a rack switch.
 Each server has a PCIe link, which is useful for connecting
the CPU servers with the GPU and TPU trays.
 Every server is arranged in a rack, which is of 7ft high, 4ft
wide, and 2ft deep, which contains about 48 slots for
loading the server, power conversion chords, network
switches, and a battery backup tray which is useful during
the power interruption.
 CPU used in Google WSC: Intel Xeon Scalable Processor
(Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel
Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel
Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon
E5 (Sandy Bridge),AMD EPYC Rome
5
Fig 2 – Server to Cluster
Fig 3 – Server Rack

Storage
 Google WSC rely on local disks and uses Google File System
(GFS), which is a distributed file system developed by
Google
 A GFS cluster contains multiple nodes and divided into two
categories:
 Master Node: storing the data required for
processing the current request
 Chunkserver: maintaining the copies of the
data
 Google WSC storage maintains at least three replicas to
improve the dependability
 Every server storage is interconnected with the server
storage in the local rack, and every server storage in local
rack is interconnected with cluster.
6
Fig 4 – Storage Hierarchy of Google WSC

Network
 The Google WSC network uses a network called
'clos,' which is a multistage network that uses
low port-count switches
 These clos networks are fault-tolerant and
provide excellent bandwidth. Google improves
the bandwidth of the network by adding more
stages to the multistage network
 Google uses a 3-stage clos network:
 Ingress Stage (input stage)
 Middle Stage
 Egress Stage (output stage)
 Google uses Jupiter Clos Network inside their
WSC.
7
Fig 5 – 3 stage clos network

Hardware Accelerators
 If the overall performance of the uniprocessor is too slow,
an additional hardware can be used to speed up the
system. This hardware is called hardware accelerator
 The hardware accelerator is a component that works with
the processor and executes the tasks much faster than
the processor
 An accelerator appears as a device on the bus
 Google WSC uses:
 Graphical Processing Unit (GPU)
 Tensor Processing Unit (TPU)
8
Fig 6 – Hardware Accelerators

Graphical Processing Unit (GPU)
 In a Google WSC, each CPU of a server is connected with a PCIe attached
accelerator tray with multiple GPUs
 The multiple GPU within the tray are interconnected with NVlink, which is a
wire-based near-range communication protocol developed by Nvidia.
 Each SM has an L1 cache associated with the core and a shared L2 cache.
 The presence of multiple CUDA cores in the Nvidia GPU makes the
computation faster than a CPU.
 In a GPU, the task to be executed is divided into several processes and sent to
the several Processor Clusters(PC) to achieve low memory latency and high
throughput.
 The GPU has small cache layers than CPU since the GPU has dedicated
transistors deployed for the computation
 Also, since there are multiple cores, parallelism can be achieved by running
the processes effectively, quickly, and reliably
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4
9
Fig 7 – Graphical Processing Unit

Graphical Processing Unit (GPU)
 For Compute workloads, following GPUs are used by
Google WSC (designed specifically for AI & data
center solutions):
 NVIDIA® Tesla® T4
 NVIDIA® Tesla® V100
 NVIDIA® Tesla® P100
 NVIDIA® Tesla® P4
 NVIDIA® Tesla® K80
No of
Cores
Memory
Size
Memory Type SM Count Tensor
Cores
L1 Cache L2 Cache FP32(float)
performance
Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS
Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS
Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS
Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS
Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS
0
5
10
15
Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80
Performance
GPU
Performance
https://cloud.google.com/compute/docs/gpus
10

Tensor Processing Unit (TPU)
 Google’s ASIC specifically designed for AI solutions
 Matrix Multiply Unit (MXU) – heart of TPU
 Contains 256x256 MAC
 Weight FIFO uses 8 GB off-chip DRAM to provides weight to the
MMU
 Unified Buffer (24 MB) keeps activation input/output of the
MMU and host
 Accumulators collects the 16 MB MMU products
11
Fig 7 – Inside the TPU

Parallel Processing on the Matrix Multiply Unit (MXU)
 Typical RISC processor process a single operation (=scalar processing)
with each instruction
 GPU uses vector processing and performs the operation concurrently
on multiple SMs. GPU performs 100-1000 of operations in a single
clock cycle
 To increase the number of operations in single clock cycle, Google
developed a matrix processor that process hundreds of thousands of
operations(=matrix operations) in a single clock cycle
 To implement a large scale matrix processor, Google uses a different
architecture than CPUs and GPUs, called a systolic array.
 MXU reads each input value once, and reuses it for many different
operations without storing it back to a register – Not like CPU
 CPUs and GPUs often spend energy to access multiple registers per
operation. A systolic array chains multiple ALUs together, reusing the
result of reading a single register.
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
12
Fig 8 – Register Access :CPU and GPU vs. TPU

Parallel Processing on the Matrix Multiply Unit (MXU)
 Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called
nodes.
 Uses Multiple Instruction Single Data (MISD) architecture
 The design is systolic because the data flows through a network of hard-wired processor nodes
 The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536
multiply and add for 8 bit integer every cycle
 Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 ×
1012 multiply-and-add operations per second
 Number of operations per cycle between CPU, GPU and TPU
CPU a few
CPU (vector extension) tens
GPU tens of thousands
TPU hundreds of thousands, up to 128k
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
13

Roofline Model – CPU vs. GPU vs. TPU
 It ties floating-point performance, memory performance and
arithmetic intensity together in a 2-D graph.
 Arithmetic Intensity
 Ratio of floating-point operation per byte of memory
accessed.
 X-axis is based on arithmetic intensity and Y-axis is
performance in floating-point operations per second
 The graph can be plotted using the following
 Attainable GFLOPs/sec = Min(Peak Memory BW x AI,
Peak floating-point perf.)
 The comparison is made between Intel Haswell(CPU), Nvidia
Tesla K80 (GPU) and TPUv1 for six different neural network
applications
 The six NN applications in CPU and GPU are below the ceiling
than TPU – TPU has higher performance
14
Fig 9 – Roofline Model of CPU
Fig 10 – Roofline Model of GPU
Fig 11 – Roofline Model of TPU

Energy Efficiency of a Google WSC
 The workload of a system seems to increase
tremendously by consuming a lot of power and energy.
 The simple metric used to calculate the efficiency of a
WSC is called power utilization effectiveness or PUE.
 PUE = (Total Facility Power) / (IT Equipment
Power)
 Power Usage Effectiveness is the relation between the
total energy entering a WSC and the energy used by IT
equipment inside the WSC
 Many requests make the WSC system busy all the time
and contributing to the other kind of energy losses like
power distribution loss, cooling loss, air loss, etc.,
34%
13%
14%
18%
14%
7%
Server Losses
Rectifier Losses
Power Distibution Loss
Cooling
Other Losses
Air Loss
0% 10% 20% 30% 40%
Energy Losses in a Google WSC
Energy Losses in a Google WSC
15

Energy Efficiency of a Google WSC
 Here are some of the workloads, which are considered to be the
highest power consumption.
 Web-search: high request throughput and data processing
requests
 Webmail: disk I/O internet service, where each machine is
configured with a large number of disk drivers to run this
workload
 MapReduce: cluster processes use hundreds or thousands
of servers to process terabytes of data by large offline jobs
 To reduce the power consumption, Google implements
 CPU Voltage/Frequency Scaling
 DVFS reduces the servers’ power consumption by
dynamically changing the voltage and frequency of a
CPU according to its load.
 Google reduces 23% of power by implementing this
technique.
16
Fig 12 – Energy Consumption Comparison

Performance of a Google WSC
 The overall WSC performance can be calculated by aggregating per-job performance
 WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)
 Weight - weight determines how much a job's performance affects the overall performance
 Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the
performance of a job
 Reason for Performance impact
 The CPU, which suffers from memory latency and memory bandwidth, has a performance
impact in the processor by suffering from stall cycles due to the cache misses.
 The lower performance is due to the data cache miss, and instruction cache misses, these
two misses contribute to a lower IPC.
17

Top-Down Micro-Architectural Analysis Method
 The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.
 Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented
TDMAM in 2015.
 Simple, Structured, Quick
 The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,
 Front-End: It is responsible for fetching the program code, and the program code is decoded into two or
more low-level hardware operations called micro-ops(uops)
 Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the
back-end checks the available execution unit and try to execute the micro-ops instructions.
 The pipeline slots are classified into four broad categories:
 Retiring - when a micro-ops leaves the queue and commits
 Bad speculation – when a pipeline slot wasted due to incorrect operation
 Front-end bound - overheads due to fetching, instruction caches, and decoding
 Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism.
18

Top-Down Micro-Architectural Analysis Method
 The chart represents the pipeline slots breakdown of
applications running in Google WSC.
 Large number of stalled cycles in back-end due to the lack of
instruction level parallelism.
 The processor finds difficult to run all the instructions
simultaneously and increases the memory stall time.
 To overcome this, Google uses Simultaneous
Multithreading to hide the latency by overlapping the stall
cycles.
 SMT is an architectural feature allowing instructions from
more than one thread to be executed in any given pipeline
stage at a time.
 SMT increases the performance of CPU by supporting
thread-level parallelism.
19

Cooling System
 The intention of WSC cooling systems is to remove the heat
generated by the equipment.
 Google WSC uses Ceiling-mounted cooling as the cooling system.
 This type of cooling system comes with a large plenum space
that removes the hot air from the data center. Once the plenum
space removes the heat from the data center, the fan coil is
responsible for blowing the cold air towards the intake of the
data center
 (1) Hot exhaust from the datacenter rises in a vertical plenum
space.
 (2) Hot air enters a large plenum space above the drop ceiling.
 (3) Heat is exchanged with process water in a fan coil unit
 (4) blows the cold air down toward the intake of the data center.
20
Fig 13 – Google’s Cooling System

Conclusion
 The computation in WSC does not rely on a single machine, and it requires hundreds
or thousands of machines connected over a network to achieve greater
performance. We also observed that, Google WSC deploys hardware accelerators
such as GPU and TPU to increase the performance and energy.
 Designing a performance-oriented and energy-efficient WSC is a main concern,
Google have implemented some power saving approaches and performance
improvement mechanisms like SMT to eliminate the stall cycles due to the cache
misses.
 Hence, Google uses the above mentioned hardware and techniques to design a
performance and energy-efficient warehouse-scale systems.
21

Google warehouse scale computer

More Related Content

What's hot (20)

Similar to Google warehouse scale computer (20)

Recently uploaded (20)

Google warehouse scale computer