SlideShare a Scribd company logo
Google Warehouse-
Scale Computer:
Hardware and
Performance
Prepared by: Tejhaskar Ashok Kumar
Master of Applied Science in Computer
Engineering
Memorial University of Newfoundland
Outline
 Introduction
 Architectural Overview of Google WSC
 Server
 Storage
 Network
 Hardware Accelerators
 GPU
 TPU
 Energy and Power Efficiency
 Performance of Google WSC
 Top-Down Micro-architectural Analysis Method
 Cooling Systems
2
Introduction
 A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process
thousands to millions of users’ request.
 A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce
 The difference between a traditional data center and a WSC:
 Data centers host services for multiple providers
 WSC run by a single organization
 WSC exploits Request Level Parallelism and Data Level Parallelism
 WSC Design Goals:
 Cost-Performance
 Energy Efficiency
 Dependability via Redundancy
 Network I/O
 Interactive and Batch processing workloads
3
Architectural Overview of Google
WSC
 In general, WSC is a building which considered as a
computer, where multiple server nodes, storage are tightly
coupled with interconnection networks.
 Inside the building, there are lot of containers and each
container contains multiple servers arranged in a rack
interconnected with storage
 Each container also has some cooling support to eliminate
the heat generated.
 A Google WSC uses more than 100,000 servers
 WSC’s are designed in a way to perform reliable and faster
to perform every request powered by internet services.
4
Fig 1 – WSC as a building
Servers
 The WSC uses a low-end server in a 1U or blade enclosure
format, and the servers are mounted in a rack and each
server is interconnected with an Ethernet switch.
 Google WSC uses a 19-inch wide rack which can hold 48
blade servers connected to a rack switch.
 Each server has a PCIe link, which is useful for connecting
the CPU servers with the GPU and TPU trays.
 Every server is arranged in a rack, which is of 7ft high, 4ft
wide, and 2ft deep, which contains about 48 slots for
loading the server, power conversion chords, network
switches, and a battery backup tray which is useful during
the power interruption.
 CPU used in Google WSC: Intel Xeon Scalable Processor
(Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel
Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel
Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon
E5 (Sandy Bridge),AMD EPYC Rome
5
Fig 2 – Server to Cluster
Fig 3 – Server Rack
Storage
 Google WSC rely on local disks and uses Google File System
(GFS), which is a distributed file system developed by
Google
 A GFS cluster contains multiple nodes and divided into two
categories:
 Master Node: storing the data required for
processing the current request
 Chunkserver: maintaining the copies of the
data
 Google WSC storage maintains at least three replicas to
improve the dependability
 Every server storage is interconnected with the server
storage in the local rack, and every server storage in local
rack is interconnected with cluster.
6
Fig 4 – Storage Hierarchy of Google WSC
Network
 The Google WSC network uses a network called
'clos,' which is a multistage network that uses
low port-count switches
 These clos networks are fault-tolerant and
provide excellent bandwidth. Google improves
the bandwidth of the network by adding more
stages to the multistage network
 Google uses a 3-stage clos network:
 Ingress Stage (input stage)
 Middle Stage
 Egress Stage (output stage)
 Google uses Jupiter Clos Network inside their
WSC.
7
Fig 5 – 3 stage clos network
Hardware Accelerators
 If the overall performance of the uniprocessor is too slow,
an additional hardware can be used to speed up the
system. This hardware is called hardware accelerator
 The hardware accelerator is a component that works with
the processor and executes the tasks much faster than
the processor
 An accelerator appears as a device on the bus
 Google WSC uses:
 Graphical Processing Unit (GPU)
 Tensor Processing Unit (TPU)
8
Fig 6 – Hardware Accelerators
Graphical Processing Unit (GPU)
 In a Google WSC, each CPU of a server is connected with a PCIe attached
accelerator tray with multiple GPUs
 The multiple GPU within the tray are interconnected with NVlink, which is a
wire-based near-range communication protocol developed by Nvidia.
 Each SM has an L1 cache associated with the core and a shared L2 cache.
 The presence of multiple CUDA cores in the Nvidia GPU makes the
computation faster than a CPU.
 In a GPU, the task to be executed is divided into several processes and sent to
the several Processor Clusters(PC) to achieve low memory latency and high
throughput.
 The GPU has small cache layers than CPU since the GPU has dedicated
transistors deployed for the computation
 Also, since there are multiple cores, parallelism can be achieved by running
the processes effectively, quickly, and reliably
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4
9
Fig 7 – Graphical Processing Unit
Graphical Processing Unit (GPU)
 For Compute workloads, following GPUs are used by
Google WSC (designed specifically for AI & data
center solutions):
 NVIDIA® Tesla® T4
 NVIDIA® Tesla® V100
 NVIDIA® Tesla® P100
 NVIDIA® Tesla® P4
 NVIDIA® Tesla® K80
No of
Cores
Memory
Size
Memory Type SM Count Tensor
Cores
L1 Cache L2 Cache FP32(float)
performance
Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS
Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS
Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS
Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS
Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS
0
5
10
15
Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80
Performance
GPU
Performance
https://cloud.google.com/compute/docs/gpus
10
Tensor Processing Unit (TPU)
 Google’s ASIC specifically designed for AI solutions
 Matrix Multiply Unit (MXU) – heart of TPU
 Contains 256x256 MAC
 Weight FIFO uses 8 GB off-chip DRAM to provides weight to the
MMU
 Unified Buffer (24 MB) keeps activation input/output of the
MMU and host
 Accumulators collects the 16 MB MMU products
11
Fig 7 – Inside the TPU
Parallel Processing on the Matrix Multiply Unit (MXU)
 Typical RISC processor process a single operation (=scalar processing)
with each instruction
 GPU uses vector processing and performs the operation concurrently
on multiple SMs. GPU performs 100-1000 of operations in a single
clock cycle
 To increase the number of operations in single clock cycle, Google
developed a matrix processor that process hundreds of thousands of
operations(=matrix operations) in a single clock cycle
 To implement a large scale matrix processor, Google uses a different
architecture than CPUs and GPUs, called a systolic array.
 MXU reads each input value once, and reuses it for many different
operations without storing it back to a register – Not like CPU
 CPUs and GPUs often spend energy to access multiple registers per
operation. A systolic array chains multiple ALUs together, reusing the
result of reading a single register.
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
12
Fig 8 – Register Access :CPU and GPU vs. TPU
Parallel Processing on the Matrix Multiply Unit (MXU)
 Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called
nodes.
 Uses Multiple Instruction Single Data (MISD) architecture
 The design is systolic because the data flows through a network of hard-wired processor nodes
 The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536
multiply and add for 8 bit integer every cycle
 Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 ×
1012 multiply-and-add operations per second
 Number of operations per cycle between CPU, GPU and TPU
CPU a few
CPU (vector extension) tens
GPU tens of thousands
TPU hundreds of thousands, up to 128k
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
13
Roofline Model – CPU vs. GPU vs. TPU
 It ties floating-point performance, memory performance and
arithmetic intensity together in a 2-D graph.
 Arithmetic Intensity
 Ratio of floating-point operation per byte of memory
accessed.
 X-axis is based on arithmetic intensity and Y-axis is
performance in floating-point operations per second
 The graph can be plotted using the following
 Attainable GFLOPs/sec = Min(Peak Memory BW x AI,
Peak floating-point perf.)
 The comparison is made between Intel Haswell(CPU), Nvidia
Tesla K80 (GPU) and TPUv1 for six different neural network
applications
 The six NN applications in CPU and GPU are below the ceiling
than TPU – TPU has higher performance
14
Fig 9 – Roofline Model of CPU
Fig 10 – Roofline Model of GPU
Fig 11 – Roofline Model of TPU
Energy Efficiency of a Google WSC
 The workload of a system seems to increase
tremendously by consuming a lot of power and energy.
 The simple metric used to calculate the efficiency of a
WSC is called power utilization effectiveness or PUE.
 PUE = (Total Facility Power) / (IT Equipment
Power)
 Power Usage Effectiveness is the relation between the
total energy entering a WSC and the energy used by IT
equipment inside the WSC
 Many requests make the WSC system busy all the time
and contributing to the other kind of energy losses like
power distribution loss, cooling loss, air loss, etc.,
34%
13%
14%
18%
14%
7%
Server Losses
Rectifier Losses
Power Distibution Loss
Cooling
Other Losses
Air Loss
0% 10% 20% 30% 40%
Energy Losses in a Google WSC
Energy Losses in a Google WSC
15
Energy Efficiency of a Google WSC
 Here are some of the workloads, which are considered to be the
highest power consumption.
 Web-search: high request throughput and data processing
requests
 Webmail: disk I/O internet service, where each machine is
configured with a large number of disk drivers to run this
workload
 MapReduce: cluster processes use hundreds or thousands
of servers to process terabytes of data by large offline jobs
 To reduce the power consumption, Google implements
 CPU Voltage/Frequency Scaling
 DVFS reduces the servers’ power consumption by
dynamically changing the voltage and frequency of a
CPU according to its load.
 Google reduces 23% of power by implementing this
technique.
16
Fig 12 – Energy Consumption Comparison
Performance of a Google WSC
 The overall WSC performance can be calculated by aggregating per-job performance
 WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)
 Weight - weight determines how much a job's performance affects the overall performance
 Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the
performance of a job
 Reason for Performance impact
 The CPU, which suffers from memory latency and memory bandwidth, has a performance
impact in the processor by suffering from stall cycles due to the cache misses.
 The lower performance is due to the data cache miss, and instruction cache misses, these
two misses contribute to a lower IPC.
17
Top-Down Micro-Architectural Analysis Method
 The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.
 Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented
TDMAM in 2015.
 Simple, Structured, Quick
 The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,
 Front-End: It is responsible for fetching the program code, and the program code is decoded into two or
more low-level hardware operations called micro-ops(uops)
 Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the
back-end checks the available execution unit and try to execute the micro-ops instructions.
 The pipeline slots are classified into four broad categories:
 Retiring - when a micro-ops leaves the queue and commits
 Bad speculation – when a pipeline slot wasted due to incorrect operation
 Front-end bound - overheads due to fetching, instruction caches, and decoding
 Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism.
18
Top-Down Micro-Architectural Analysis Method
 The chart represents the pipeline slots breakdown of
applications running in Google WSC.
 Large number of stalled cycles in back-end due to the lack of
instruction level parallelism.
 The processor finds difficult to run all the instructions
simultaneously and increases the memory stall time.
 To overcome this, Google uses Simultaneous
Multithreading to hide the latency by overlapping the stall
cycles.
 SMT is an architectural feature allowing instructions from
more than one thread to be executed in any given pipeline
stage at a time.
 SMT increases the performance of CPU by supporting
thread-level parallelism.
19
Cooling System
 The intention of WSC cooling systems is to remove the heat
generated by the equipment.
 Google WSC uses Ceiling-mounted cooling as the cooling system.
 This type of cooling system comes with a large plenum space
that removes the hot air from the data center. Once the plenum
space removes the heat from the data center, the fan coil is
responsible for blowing the cold air towards the intake of the
data center
 (1) Hot exhaust from the datacenter rises in a vertical plenum
space.
 (2) Hot air enters a large plenum space above the drop ceiling.
 (3) Heat is exchanged with process water in a fan coil unit
 (4) blows the cold air down toward the intake of the data center.
20
Fig 13 – Google’s Cooling System
Conclusion
 The computation in WSC does not rely on a single machine, and it requires hundreds
or thousands of machines connected over a network to achieve greater
performance. We also observed that, Google WSC deploys hardware accelerators
such as GPU and TPU to increase the performance and energy.
 Designing a performance-oriented and energy-efficient WSC is a main concern,
Google have implemented some power saving approaches and performance
improvement mechanisms like SMT to eliminate the stall cycles due to the cache
misses.
 Hence, Google uses the above mentioned hardware and techniques to design a
performance and energy-efficient warehouse-scale systems.
21
Thank You
22

More Related Content

PDF
Machine Learning with New Hardware Challegens
PPTX
Exascale Capabl
PDF
GPU Programming
PPTX
Gpu with cuda architecture
PPTX
GPGPU programming with CUDA
PPTX
Lec04 gpu architecture
PDF
From Rack scale computers to Warehouse scale computers
PDF
The Rise of Parallel Computing
Machine Learning with New Hardware Challegens
Exascale Capabl
GPU Programming
Gpu with cuda architecture
GPGPU programming with CUDA
Lec04 gpu architecture
From Rack scale computers to Warehouse scale computers
The Rise of Parallel Computing

What's hot (20)

PPTX
Lec06 memory
PPTX
Kindratenko hpc day 2011 Kiev
PDF
Evolution of Supermicro GPU Server Solution
PDF
Early Benchmarking Results for Neuromorphic Computing
PDF
GIST AI-X Computing Cluster
PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
PDF
intel speed-select-technology-base-frequency-enhancing-performance
PDF
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
PDF
20160407_GTC2016_PgSQL_In_Place
PDF
Supermicro High Performance Enterprise Hadoop Infrastructure
PDF
Introduction to High-Performance Computing (HPC) Containers and Singularity*
PPTX
PDF
SGI HPC Update for June 2013
PDF
Expectations for optical network from the viewpoint of system software research
PDF
Let's turn your PostgreSQL into columnar store with cstore_fdw
PDF
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PDF
Computing using GPUs
PDF
A cloud gaming system based on user level virtualization and its resource sch...
PDF
Bullx HPC eXtreme computing cluster references
PDF
Exploring the Performance Impact of Virtualization on an HPC Cloud
Lec06 memory
Kindratenko hpc day 2011 Kiev
Evolution of Supermicro GPU Server Solution
Early Benchmarking Results for Neuromorphic Computing
GIST AI-X Computing Cluster
GPU Architecture NVIDIA (GTX GeForce 480)
intel speed-select-technology-base-frequency-enhancing-performance
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
20160407_GTC2016_PgSQL_In_Place
Supermicro High Performance Enterprise Hadoop Infrastructure
Introduction to High-Performance Computing (HPC) Containers and Singularity*
SGI HPC Update for June 2013
Expectations for optical network from the viewpoint of system software research
Let's turn your PostgreSQL into columnar store with cstore_fdw
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Computing using GPUs
A cloud gaming system based on user level virtualization and its resource sch...
Bullx HPC eXtreme computing cluster references
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ad

Similar to Google warehouse scale computer (20)

PDF
2017 04-13-google-tpu-04
PDF
In datacenter performance analysis of a tensor processing unit
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
GTC 2022 Keynote
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
PPTX
Google cluster architecture
PPTX
Hardware architecture of Summit Supercomputer
PDF
AI Accelerators for Cloud Datacenters
PDF
GTC 2017: Powering the AI Revolution
PPTX
Graphics Processing unit ppt
PPTX
Cloud infrastructure, Virtualization tec
PPTX
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
PDF
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
PDF
The Convergence of HPC and Deep Learning
PPTX
Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism...
PDF
Gtc2013 recap
PDF
NVIDIA GTC 2013 HIGHLIGHTS
PDF
Deep learning: Hardware Landscape
PDF
組み込みから HPC まで ARM コアで実現するエコシステム
PDF
Nikravesh big datafeb2013bt
2017 04-13-google-tpu-04
In datacenter performance analysis of a tensor processing unit
Hardware & Software Platforms for HPC, AI and ML
GTC 2022 Keynote
HPC Infrastructure To Solve The CFD Grand Challenge
Google cluster architecture
Hardware architecture of Summit Supercomputer
AI Accelerators for Cloud Datacenters
GTC 2017: Powering the AI Revolution
Graphics Processing unit ppt
Cloud infrastructure, Virtualization tec
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
The Convergence of HPC and Deep Learning
Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism...
Gtc2013 recap
NVIDIA GTC 2013 HIGHLIGHTS
Deep learning: Hardware Landscape
組み込みから HPC まで ARM コアで実現するエコシステム
Nikravesh big datafeb2013bt
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Well-logging-methods_new................
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Project quality management in manufacturing
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
PDF
PPT on Performance Review to get promotions
PPT
Mechanical Engineering MATERIALS Selection
PDF
composite construction of structures.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
web development for engineering and engineering
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
DOCX
573137875-Attendance-Management-System-original
PDF
Digital Logic Computer Design lecture notes
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Well-logging-methods_new................
OOP with Java - Java Introduction (Basics)
CH1 Production IntroductoryConcepts.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Project quality management in manufacturing
Internet of Things (IOT) - A guide to understanding
Construction Project Organization Group 2.pptx
PPT on Performance Review to get promotions
Mechanical Engineering MATERIALS Selection
composite construction of structures.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
web development for engineering and engineering
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
573137875-Attendance-Management-System-original
Digital Logic Computer Design lecture notes

Google warehouse scale computer

  • 1. Google Warehouse- Scale Computer: Hardware and Performance Prepared by: Tejhaskar Ashok Kumar Master of Applied Science in Computer Engineering Memorial University of Newfoundland
  • 2. Outline  Introduction  Architectural Overview of Google WSC  Server  Storage  Network  Hardware Accelerators  GPU  TPU  Energy and Power Efficiency  Performance of Google WSC  Top-Down Micro-architectural Analysis Method  Cooling Systems 2
  • 3. Introduction  A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process thousands to millions of users’ request.  A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce  The difference between a traditional data center and a WSC:  Data centers host services for multiple providers  WSC run by a single organization  WSC exploits Request Level Parallelism and Data Level Parallelism  WSC Design Goals:  Cost-Performance  Energy Efficiency  Dependability via Redundancy  Network I/O  Interactive and Batch processing workloads 3
  • 4. Architectural Overview of Google WSC  In general, WSC is a building which considered as a computer, where multiple server nodes, storage are tightly coupled with interconnection networks.  Inside the building, there are lot of containers and each container contains multiple servers arranged in a rack interconnected with storage  Each container also has some cooling support to eliminate the heat generated.  A Google WSC uses more than 100,000 servers  WSC’s are designed in a way to perform reliable and faster to perform every request powered by internet services. 4 Fig 1 – WSC as a building
  • 5. Servers  The WSC uses a low-end server in a 1U or blade enclosure format, and the servers are mounted in a rack and each server is interconnected with an Ethernet switch.  Google WSC uses a 19-inch wide rack which can hold 48 blade servers connected to a rack switch.  Each server has a PCIe link, which is useful for connecting the CPU servers with the GPU and TPU trays.  Every server is arranged in a rack, which is of 7ft high, 4ft wide, and 2ft deep, which contains about 48 slots for loading the server, power conversion chords, network switches, and a battery backup tray which is useful during the power interruption.  CPU used in Google WSC: Intel Xeon Scalable Processor (Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon E5 (Sandy Bridge),AMD EPYC Rome 5 Fig 2 – Server to Cluster Fig 3 – Server Rack
  • 6. Storage  Google WSC rely on local disks and uses Google File System (GFS), which is a distributed file system developed by Google  A GFS cluster contains multiple nodes and divided into two categories:  Master Node: storing the data required for processing the current request  Chunkserver: maintaining the copies of the data  Google WSC storage maintains at least three replicas to improve the dependability  Every server storage is interconnected with the server storage in the local rack, and every server storage in local rack is interconnected with cluster. 6 Fig 4 – Storage Hierarchy of Google WSC
  • 7. Network  The Google WSC network uses a network called 'clos,' which is a multistage network that uses low port-count switches  These clos networks are fault-tolerant and provide excellent bandwidth. Google improves the bandwidth of the network by adding more stages to the multistage network  Google uses a 3-stage clos network:  Ingress Stage (input stage)  Middle Stage  Egress Stage (output stage)  Google uses Jupiter Clos Network inside their WSC. 7 Fig 5 – 3 stage clos network
  • 8. Hardware Accelerators  If the overall performance of the uniprocessor is too slow, an additional hardware can be used to speed up the system. This hardware is called hardware accelerator  The hardware accelerator is a component that works with the processor and executes the tasks much faster than the processor  An accelerator appears as a device on the bus  Google WSC uses:  Graphical Processing Unit (GPU)  Tensor Processing Unit (TPU) 8 Fig 6 – Hardware Accelerators
  • 9. Graphical Processing Unit (GPU)  In a Google WSC, each CPU of a server is connected with a PCIe attached accelerator tray with multiple GPUs  The multiple GPU within the tray are interconnected with NVlink, which is a wire-based near-range communication protocol developed by Nvidia.  Each SM has an L1 cache associated with the core and a shared L2 cache.  The presence of multiple CUDA cores in the Nvidia GPU makes the computation faster than a CPU.  In a GPU, the task to be executed is divided into several processes and sent to the several Processor Clusters(PC) to achieve low memory latency and high throughput.  The GPU has small cache layers than CPU since the GPU has dedicated transistors deployed for the computation  Also, since there are multiple cores, parallelism can be achieved by running the processes effectively, quickly, and reliably https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4 9 Fig 7 – Graphical Processing Unit
  • 10. Graphical Processing Unit (GPU)  For Compute workloads, following GPUs are used by Google WSC (designed specifically for AI & data center solutions):  NVIDIA® Tesla® T4  NVIDIA® Tesla® V100  NVIDIA® Tesla® P100  NVIDIA® Tesla® P4  NVIDIA® Tesla® K80 No of Cores Memory Size Memory Type SM Count Tensor Cores L1 Cache L2 Cache FP32(float) performance Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS 0 5 10 15 Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80 Performance GPU Performance https://cloud.google.com/compute/docs/gpus 10
  • 11. Tensor Processing Unit (TPU)  Google’s ASIC specifically designed for AI solutions  Matrix Multiply Unit (MXU) – heart of TPU  Contains 256x256 MAC  Weight FIFO uses 8 GB off-chip DRAM to provides weight to the MMU  Unified Buffer (24 MB) keeps activation input/output of the MMU and host  Accumulators collects the 16 MB MMU products 11 Fig 7 – Inside the TPU
  • 12. Parallel Processing on the Matrix Multiply Unit (MXU)  Typical RISC processor process a single operation (=scalar processing) with each instruction  GPU uses vector processing and performs the operation concurrently on multiple SMs. GPU performs 100-1000 of operations in a single clock cycle  To increase the number of operations in single clock cycle, Google developed a matrix processor that process hundreds of thousands of operations(=matrix operations) in a single clock cycle  To implement a large scale matrix processor, Google uses a different architecture than CPUs and GPUs, called a systolic array.  MXU reads each input value once, and reuses it for many different operations without storing it back to a register – Not like CPU  CPUs and GPUs often spend energy to access multiple registers per operation. A systolic array chains multiple ALUs together, reusing the result of reading a single register. https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 12 Fig 8 – Register Access :CPU and GPU vs. TPU
  • 13. Parallel Processing on the Matrix Multiply Unit (MXU)  Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called nodes.  Uses Multiple Instruction Single Data (MISD) architecture  The design is systolic because the data flows through a network of hard-wired processor nodes  The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536 multiply and add for 8 bit integer every cycle  Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 × 1012 multiply-and-add operations per second  Number of operations per cycle between CPU, GPU and TPU CPU a few CPU (vector extension) tens GPU tens of thousands TPU hundreds of thousands, up to 128k https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 13
  • 14. Roofline Model – CPU vs. GPU vs. TPU  It ties floating-point performance, memory performance and arithmetic intensity together in a 2-D graph.  Arithmetic Intensity  Ratio of floating-point operation per byte of memory accessed.  X-axis is based on arithmetic intensity and Y-axis is performance in floating-point operations per second  The graph can be plotted using the following  Attainable GFLOPs/sec = Min(Peak Memory BW x AI, Peak floating-point perf.)  The comparison is made between Intel Haswell(CPU), Nvidia Tesla K80 (GPU) and TPUv1 for six different neural network applications  The six NN applications in CPU and GPU are below the ceiling than TPU – TPU has higher performance 14 Fig 9 – Roofline Model of CPU Fig 10 – Roofline Model of GPU Fig 11 – Roofline Model of TPU
  • 15. Energy Efficiency of a Google WSC  The workload of a system seems to increase tremendously by consuming a lot of power and energy.  The simple metric used to calculate the efficiency of a WSC is called power utilization effectiveness or PUE.  PUE = (Total Facility Power) / (IT Equipment Power)  Power Usage Effectiveness is the relation between the total energy entering a WSC and the energy used by IT equipment inside the WSC  Many requests make the WSC system busy all the time and contributing to the other kind of energy losses like power distribution loss, cooling loss, air loss, etc., 34% 13% 14% 18% 14% 7% Server Losses Rectifier Losses Power Distibution Loss Cooling Other Losses Air Loss 0% 10% 20% 30% 40% Energy Losses in a Google WSC Energy Losses in a Google WSC 15
  • 16. Energy Efficiency of a Google WSC  Here are some of the workloads, which are considered to be the highest power consumption.  Web-search: high request throughput and data processing requests  Webmail: disk I/O internet service, where each machine is configured with a large number of disk drivers to run this workload  MapReduce: cluster processes use hundreds or thousands of servers to process terabytes of data by large offline jobs  To reduce the power consumption, Google implements  CPU Voltage/Frequency Scaling  DVFS reduces the servers’ power consumption by dynamically changing the voltage and frequency of a CPU according to its load.  Google reduces 23% of power by implementing this technique. 16 Fig 12 – Energy Consumption Comparison
  • 17. Performance of a Google WSC  The overall WSC performance can be calculated by aggregating per-job performance  WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)  Weight - weight determines how much a job's performance affects the overall performance  Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the performance of a job  Reason for Performance impact  The CPU, which suffers from memory latency and memory bandwidth, has a performance impact in the processor by suffering from stall cycles due to the cache misses.  The lower performance is due to the data cache miss, and instruction cache misses, these two misses contribute to a lower IPC. 17
  • 18. Top-Down Micro-Architectural Analysis Method  The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.  Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented TDMAM in 2015.  Simple, Structured, Quick  The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,  Front-End: It is responsible for fetching the program code, and the program code is decoded into two or more low-level hardware operations called micro-ops(uops)  Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the back-end checks the available execution unit and try to execute the micro-ops instructions.  The pipeline slots are classified into four broad categories:  Retiring - when a micro-ops leaves the queue and commits  Bad speculation – when a pipeline slot wasted due to incorrect operation  Front-end bound - overheads due to fetching, instruction caches, and decoding  Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism. 18
  • 19. Top-Down Micro-Architectural Analysis Method  The chart represents the pipeline slots breakdown of applications running in Google WSC.  Large number of stalled cycles in back-end due to the lack of instruction level parallelism.  The processor finds difficult to run all the instructions simultaneously and increases the memory stall time.  To overcome this, Google uses Simultaneous Multithreading to hide the latency by overlapping the stall cycles.  SMT is an architectural feature allowing instructions from more than one thread to be executed in any given pipeline stage at a time.  SMT increases the performance of CPU by supporting thread-level parallelism. 19
  • 20. Cooling System  The intention of WSC cooling systems is to remove the heat generated by the equipment.  Google WSC uses Ceiling-mounted cooling as the cooling system.  This type of cooling system comes with a large plenum space that removes the hot air from the data center. Once the plenum space removes the heat from the data center, the fan coil is responsible for blowing the cold air towards the intake of the data center  (1) Hot exhaust from the datacenter rises in a vertical plenum space.  (2) Hot air enters a large plenum space above the drop ceiling.  (3) Heat is exchanged with process water in a fan coil unit  (4) blows the cold air down toward the intake of the data center. 20 Fig 13 – Google’s Cooling System
  • 21. Conclusion  The computation in WSC does not rely on a single machine, and it requires hundreds or thousands of machines connected over a network to achieve greater performance. We also observed that, Google WSC deploys hardware accelerators such as GPU and TPU to increase the performance and energy.  Designing a performance-oriented and energy-efficient WSC is a main concern, Google have implemented some power saving approaches and performance improvement mechanisms like SMT to eliminate the stall cycles due to the cache misses.  Hence, Google uses the above mentioned hardware and techniques to design a performance and energy-efficient warehouse-scale systems. 21