SlideShare a Scribd company logo
SUMMIT
SUPERCOMPUTER
Supervisor: Dr. R. Venkatesan
Presentation by: Vigneshwar Ramaswamy
Masc. in Computer Engineering
MUN ID: 201990029
Memorial University of Newfoundland, Canada
Summit Supercomputer Architecture 1
Outline
• Introduction
• Summit Overview
• Specification of Summit
• IBM Power9 Architecture
• NVIDIA Tesla V100 Architecture
• Interconnect
• Application
Summit Supercomputer Architecture 2
Introduction
• Summit was the fastest computer in the world from November 2018 to June 2020.
• 2nd Rank on TOP500 peak speed 148.6 pflops ( High Performance Linpack benchmark).
• 8th Rank on Green500 with power efficiency of 14.719 Gflops/watt.
• As of June 2018 – 2020, the summit topped HPCG benchmark used by 5 out of 6
Gordon Bell Finalist teams.
• Summit has Achieved to reach exa operations per second (exaop), achieving 1.88
exaops during a Genmoic Analysis and expected to reach 3.3 exaops using mixed
precision calculations.
Summit Supercomputer Architecture 3
Summit Overview and Specifications
• Processor: IBM POWER9™ (2/node)
• GPUs: 27,648 NVIDIA Volta V100s (6/node)
• Theoretical Peak (Rpeak) performance :200 Pflops
• Linpack performance :-148.6 PFlops.
• It has 2,414,592 cores
• 250petabytes storage capacity
• Nodes: 4,608
• Memory/ each node: 512GB DDR4 + 96GB HBM2 (1/2TF,CPU-GPU accessing)
• NV Memory/node: 1600GB
• Total System Memory: >10PB DDR4 + HBM + Non-volatile
Summit Supercomputer Architecture 4
Summit Overview and Specifications
• Interconnect Topology: Mellanox EDR 100G InfiniBand,Non-blocking Fat Tree
• 25gigabytes per second between nodes
• In-Network Computing acceleration for communications frameworks such as
MPI(Message Passing Interface).
• Peak Power Consumption: 13MW
• Operating system :Red Hat Enterprise Linux (RHEL) version 7.4.
Summit Supercomputer Architecture 5
Summit Nodes
Summit Supercomputer Architecture 6
FIGURE 1: SUMMIT NODE BLOCK DIAGRAM
SOURCE: Summit, Oak Ridge National Laboratory (official web page), https://www.olcf.ornl.gov/summit/
IBM POWER9 Processor
• Summit’s POWER9 processor contain 24 active
cores (4 hardware threads/core).
• Peripheral component interconnect express
(PCI – Express) Gen4.
• NVLink 2.0
• 14nm finFET Semiconductor Process with 8.0
billion transistors
• High Bandwidth Signaling Technology
• 16Gb/s interface – Local SMP
• 25 Gb/s interface – 25G Link – Accelerator,
remote SMP
Summit Supercomputer Architecture 7
FIGURE 2: POWER9 ARCHITECTURE
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9
Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr.
2017.doi: 10.1109/MM.2017.40
Core pipeline
• Microarchitecture has Reduced pipeline
length.
• Removes the instruction grouping
technique .
• Introduces new features to proactively
avoid hazards in the load store unit (LSU)
and improve the LSU’s execution efficiency.
• Complete up to 128 instruction per
cycle.(SMT 4)
• New lock management control improves
the performance
Summit Supercomputer Architecture 8
FIGURE 3: POWER9 VS POWER8 PIPELINE STAGES
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9
Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr.
2017.doi: 10.1109/MM.2017.40
Key components of Power9 core
Summit Supercomputer Architecture 9
Figure 4: SMT4 Core Figure 5: SMT8 Core
Figure 6: Power9 SMT4 core. The detailed core block diagram
shows all the key components of the Power9 core.
Cache Capacity of
POWER9 Processor
• L1I: 32 KiB (per core, 8-way set associative)
• L1D: 32 KiB (per core, 8-way)
• L2: 512 KiB (per pair of cores)
• L3: 120 MiB eDRAM, 20-way
Summit Supercomputer Architecture 10
FIGURE 7: SMT8 Cache
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE
Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
NVDIA Tesla V100
GPU Architecture
• This GPU is built with 21 billion transistors
• It has peak performance of 7.8 TFLOP/s of
double precision floating point performance
(FP64)
• It has 15.7 TFLOP/s of single precision
performance(FP32).
• It has 5376 FP32 cores, 5376 INT32 cores,
2688 FP64 cores, 672 Tensor cores, 366
Texture units.
• (8) 512-bit memory controllers control
access to the 16 GB of HBM2 memory.
• 6 MB L2 cache that is available to the SMs
• NVIDIA’s NVLink interconnect to pass data
between GPUs as well as from CPU-to-GPU
Summit Supercomputer Architecture 11
FIGURE 8: NVIDIA TESLA V100 GPU ARCHITECTURE
SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper,
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
whitepaper.pdf
Volta Streaming Multiprocessor
• This new Streaming Multiprocessor architecture delivers major improvements in
performance and energy efficiency.
• New mixed precision tensor Cores.
• 50% higher efficiency on general computation workloads.
• High performance L1 data cache.
• V100 SM has 64 FP32 cores and 32 FP64 cores per SM.
• Supports more threads, warps, and thread blocks when compared to prior GPU
generations
• A 128-KB combined memory block for shared memory and L1 cache can be
configured to allow up to 96 KB of shared memory.
• Each SM has four texture units which use to set the size of the L1 cache.
Summit Supercomputer Architecture 12
FIGURE 9: VOLTA GV100 Streaming
Multiprocessor (SM)
SOURCE: NVIDIA TESLA V100 GPU Architecture,
White paper, https://images.nvidia.com/content/volta-
architecture/pdf/volta-architecture-whitepaper.pdf
Tensor Cores
• V100 GPU contains 640 Tensor Cores: eight
(8) per SM and two (2) per each processing
block (partition) within an SM.
• Each Tensor Cores performs 64 FP
FMA(fused multiplication and addition)
operations per clock.
• For deep learning training ,Tensor Cores
provide up to 12x higher peak TFLOPS on
Tesla V100 compared to pascal.
• For deep learning inference, Tensor Cores
provide up to 6x higher peak TFLOPS on
Tesla V100 when ompared to pascal.
Summit Supercomputer Architecture 13
FIGURE 10: Pascal and Volta 4 x 4 matrix multiplication
SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper,
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
whitepaper.pdf
Tensor cores
• Each Tensor Core operates on a 4x4 matrix and performs
the following operation:
• D = A×B + C, where A, B, C, and D are 4x4 matrices.
• Each FP16 multiply gives a full-precision product which is
accumulated in a FP32 addition to provide the result.
Summit Supercomputer Architecture 14
FIGURE 11: Tensor Core 4 x 4 Matrix Multiply and
accumulate
FIGURE 12: Mixed Precision Multiply and Accumulate in
Tensor core
Performance of Tensor Cores on Matrix
Multiplications
Summit Supercomputer Architecture 15
FIGURE 13: Single precision (FP32) FIGURE 14: Mixed precision
NVIDIA NVLink
• In Summit Supercomputer, the
Tesla V100 accelerators and
Power9 CPUs are connected with
NVLink.
• More performance when
compared to PCLe interconnects.
• Each link provides 25
Gigabytes/second in each
direction.
Summit Supercomputer Architecture 16
FIGURE 15: NVDIA NVLink
Interconnect
• Nodes are connected with Mellanox dual rail EDR InfiniBand network.
• Each node it gives 25 GB/s Bandwidth .
• Using dual-rail Mellanox EDR(Enhanced Data Rate) 100Gb/s InfiniBand interconnect for both
storage and inter-process communications traffic
• All nodes are interconnected with Non-Blocking Fat Tree topology.
• Implemented by three level tree.
Summit Supercomputer Architecture 17
FIGURE 16: ConnectX-5adapterandinterface
withPOWER9 chips
FIGURE 17: Fat Tree Topology
Application- Finding the Drug Compounds to fight against the
corona virus
• Summit was used to screen through a library of 8000 datasets of known FDA approved drug compounds to
fight against the corona virus.
• Narrowed down the dataset to 77 in just 2 days.
• Summit uses Virus genome to search for a very specific type of drug compounds.
• On comparing with the world’s fastest computer Fugaku, which was used to conduct molecule level
simulations.
• narrowed from 2128 existing drugs and picked 12 drugs that bond easily to the proteins in 10 days.
• Fugaku can perform more than 415 quadrillion computations a second which is 2.8 times faster than summit.
Summit Supercomputer Architecture 18
Comparison with other Supercomputers
Summit Supercomputer Architecture 19
Rank Rmax Name Model Processor Cores Interconnect Memory Manufact
urer
Operating
system
Rpeak
(PFLOPS)
1 415.530 FUGAKU SUPERCOMPUTER
FUGAKU
A64FX 48C 2.2GHz 7,299,072
Tofu interconnect D
4,866,048 GB Fujitsu Red Hat Enterprise
Linux
513.855
2 148.6 SUMMIT IBM POWER
SYSTEM AC922
IBM POWER9 22C
3.07GHz
2,414,592 Dual-rail Mellanox EDR
Infiniband
2,801,664 GB IBM RHEL 7.4
200.795
3 94.640 SIERRA IBM POWER
SYSTEM AC922
IBM POWER9 22C
3.07GHz
1,572,480 Dual-rail Mellanox EDR
Infiniband
1,382,400 GB IBM RHEL 7.4
125.712
4 93.014 SUNWAY
TAIHULIGHT
SUNWAY MPP SUNWAY
SW26010 260C
1.45 GHZ
10,649,600 Sunway 1,310,720 GB NRCPC Sunway RaiseOS
2.0.5
125.436
Supercomputers
development
over the past 27
years
Summit Supercomputer Architecture 20
CM-5 Supercomputer
Fugaku Supercomputer Sunway Taihu Light
Summit Supercomputer
•Thank you
Summit Supercomputer Architecture 21

More Related Content

PPTX
how to move data from on premise to ssis in google cloud platform ,azure, sno...
PPTX
Intro To Machine Learning in Python
PDF
Dataflow with Apache NiFi
PPTX
ML Infrastracture @ Dropbox
PPTX
Machine Learning
PDF
Privacy preserving machine learning
PPTX
Google Cloud Composer
PPTX
Dowhy: An end-to-end library for causal inference
how to move data from on premise to ssis in google cloud platform ,azure, sno...
Intro To Machine Learning in Python
Dataflow with Apache NiFi
ML Infrastracture @ Dropbox
Machine Learning
Privacy preserving machine learning
Google Cloud Composer
Dowhy: An end-to-end library for causal inference

What's hot (20)

PPTX
Decision Tree Learning
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Machine learning
PDF
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
PDF
An Introduction to Neural Architecture Search
PDF
High Performance Computing
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PDF
Continual Learning with Deep Architectures - Tutorial ICML 2021
PPT
Big Data
PDF
Facebook Messages & HBase
PDF
VSSML16 L6. Feature Engineering
PDF
General introduction to AI ML DL DS
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PPTX
Business intelligence- Components, Tools, Need and Applications
PDF
Introduction to XGBoost
PPTX
The Evolution of the Data Centre
PDF
Building a performing Machine Learning model from A to Z
PDF
Building a Feature Store around Dataframes and Apache Spark
PPTX
A Comprehensive Review of Large Language Models for.pptx
PDF
Practicing Data Science: A Collection of Case Studies
Decision Tree Learning
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Machine learning
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
An Introduction to Neural Architecture Search
High Performance Computing
Introduction to Apache NiFi dws19 DWS - DC 2019
Continual Learning with Deep Architectures - Tutorial ICML 2021
Big Data
Facebook Messages & HBase
VSSML16 L6. Feature Engineering
General introduction to AI ML DL DS
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Business intelligence- Components, Tools, Need and Applications
Introduction to XGBoost
The Evolution of the Data Centre
Building a performing Machine Learning model from A to Z
Building a Feature Store around Dataframes and Apache Spark
A Comprehensive Review of Large Language Models for.pptx
Practicing Data Science: A Collection of Case Studies
Ad

Similar to Hardware architecture of Summit Supercomputer (20)

PDF
HPC Infrastructure To Solve The CFD Grand Challenge
PDF
組み込みから HPC まで ARM コアで実現するエコシステム
PDF
POWER9 for AI & HPC
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
PDF
Latest HPC News from NVIDIA
DOCX
Supercomputer - Overview
PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
PPTX
Streaming multiprocessors and HPC
PDF
Ac922 cdac webinar
PDF
GTC 2017: Powering the AI Revolution
PPTX
HPC Top 5 Stories: January 12, 2018
PDF
Barcelona Supercomputing Center, Generador de Riqueza
PDF
POWER9 AC922 Newell System - HPC & AI
PDF
Designing HPC Architectures at the Barcelona Supercomputing Center
PDF
Deeplearningusingcloudpakfordata
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
PDF
Areportaboutdatavbasesandtheirusageinworld.pdf
PDF
Mauricio breteernitiz hpc-exascale-iscte
PDF
POWER10 innovations for HPC
HPC Infrastructure To Solve The CFD Grand Challenge
組み込みから HPC まで ARM コアで実現するエコシステム
POWER9 for AI & HPC
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
Latest HPC News from NVIDIA
Supercomputer - Overview
GPU Architecture NVIDIA (GTX GeForce 480)
Streaming multiprocessors and HPC
Ac922 cdac webinar
GTC 2017: Powering the AI Revolution
HPC Top 5 Stories: January 12, 2018
Barcelona Supercomputing Center, Generador de Riqueza
POWER9 AC922 Newell System - HPC & AI
Designing HPC Architectures at the Barcelona Supercomputing Center
Deeplearningusingcloudpakfordata
Hardware & Software Platforms for HPC, AI and ML
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
Areportaboutdatavbasesandtheirusageinworld.pdf
Mauricio breteernitiz hpc-exascale-iscte
POWER10 innovations for HPC
Ad

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Digital Logic Computer Design lecture notes
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
DOCX
573137875-Attendance-Management-System-original
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Digital Logic Computer Design lecture notes
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
573137875-Attendance-Management-System-original
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
additive manufacturing of ss316l using mig welding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Foundation to blockchain - A guide to Blockchain Tech
Embodied AI: Ushering in the Next Era of Intelligent Systems

Hardware architecture of Summit Supercomputer

  • 1. SUMMIT SUPERCOMPUTER Supervisor: Dr. R. Venkatesan Presentation by: Vigneshwar Ramaswamy Masc. in Computer Engineering MUN ID: 201990029 Memorial University of Newfoundland, Canada Summit Supercomputer Architecture 1
  • 2. Outline • Introduction • Summit Overview • Specification of Summit • IBM Power9 Architecture • NVIDIA Tesla V100 Architecture • Interconnect • Application Summit Supercomputer Architecture 2
  • 3. Introduction • Summit was the fastest computer in the world from November 2018 to June 2020. • 2nd Rank on TOP500 peak speed 148.6 pflops ( High Performance Linpack benchmark). • 8th Rank on Green500 with power efficiency of 14.719 Gflops/watt. • As of June 2018 – 2020, the summit topped HPCG benchmark used by 5 out of 6 Gordon Bell Finalist teams. • Summit has Achieved to reach exa operations per second (exaop), achieving 1.88 exaops during a Genmoic Analysis and expected to reach 3.3 exaops using mixed precision calculations. Summit Supercomputer Architecture 3
  • 4. Summit Overview and Specifications • Processor: IBM POWER9™ (2/node) • GPUs: 27,648 NVIDIA Volta V100s (6/node) • Theoretical Peak (Rpeak) performance :200 Pflops • Linpack performance :-148.6 PFlops. • It has 2,414,592 cores • 250petabytes storage capacity • Nodes: 4,608 • Memory/ each node: 512GB DDR4 + 96GB HBM2 (1/2TF,CPU-GPU accessing) • NV Memory/node: 1600GB • Total System Memory: >10PB DDR4 + HBM + Non-volatile Summit Supercomputer Architecture 4
  • 5. Summit Overview and Specifications • Interconnect Topology: Mellanox EDR 100G InfiniBand,Non-blocking Fat Tree • 25gigabytes per second between nodes • In-Network Computing acceleration for communications frameworks such as MPI(Message Passing Interface). • Peak Power Consumption: 13MW • Operating system :Red Hat Enterprise Linux (RHEL) version 7.4. Summit Supercomputer Architecture 5
  • 6. Summit Nodes Summit Supercomputer Architecture 6 FIGURE 1: SUMMIT NODE BLOCK DIAGRAM SOURCE: Summit, Oak Ridge National Laboratory (official web page), https://www.olcf.ornl.gov/summit/
  • 7. IBM POWER9 Processor • Summit’s POWER9 processor contain 24 active cores (4 hardware threads/core). • Peripheral component interconnect express (PCI – Express) Gen4. • NVLink 2.0 • 14nm finFET Semiconductor Process with 8.0 billion transistors • High Bandwidth Signaling Technology • 16Gb/s interface – Local SMP • 25 Gb/s interface – 25G Link – Accelerator, remote SMP Summit Supercomputer Architecture 7 FIGURE 2: POWER9 ARCHITECTURE SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 8. Core pipeline • Microarchitecture has Reduced pipeline length. • Removes the instruction grouping technique . • Introduces new features to proactively avoid hazards in the load store unit (LSU) and improve the LSU’s execution efficiency. • Complete up to 128 instruction per cycle.(SMT 4) • New lock management control improves the performance Summit Supercomputer Architecture 8 FIGURE 3: POWER9 VS POWER8 PIPELINE STAGES SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 9. Key components of Power9 core Summit Supercomputer Architecture 9 Figure 4: SMT4 Core Figure 5: SMT8 Core Figure 6: Power9 SMT4 core. The detailed core block diagram shows all the key components of the Power9 core.
  • 10. Cache Capacity of POWER9 Processor • L1I: 32 KiB (per core, 8-way set associative) • L1D: 32 KiB (per core, 8-way) • L2: 512 KiB (per pair of cores) • L3: 120 MiB eDRAM, 20-way Summit Supercomputer Architecture 10 FIGURE 7: SMT8 Cache SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 11. NVDIA Tesla V100 GPU Architecture • This GPU is built with 21 billion transistors • It has peak performance of 7.8 TFLOP/s of double precision floating point performance (FP64) • It has 15.7 TFLOP/s of single precision performance(FP32). • It has 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor cores, 366 Texture units. • (8) 512-bit memory controllers control access to the 16 GB of HBM2 memory. • 6 MB L2 cache that is available to the SMs • NVIDIA’s NVLink interconnect to pass data between GPUs as well as from CPU-to-GPU Summit Supercomputer Architecture 11 FIGURE 8: NVIDIA TESLA V100 GPU ARCHITECTURE SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf
  • 12. Volta Streaming Multiprocessor • This new Streaming Multiprocessor architecture delivers major improvements in performance and energy efficiency. • New mixed precision tensor Cores. • 50% higher efficiency on general computation workloads. • High performance L1 data cache. • V100 SM has 64 FP32 cores and 32 FP64 cores per SM. • Supports more threads, warps, and thread blocks when compared to prior GPU generations • A 128-KB combined memory block for shared memory and L1 cache can be configured to allow up to 96 KB of shared memory. • Each SM has four texture units which use to set the size of the L1 cache. Summit Supercomputer Architecture 12 FIGURE 9: VOLTA GV100 Streaming Multiprocessor (SM) SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta- architecture/pdf/volta-architecture-whitepaper.pdf
  • 13. Tensor Cores • V100 GPU contains 640 Tensor Cores: eight (8) per SM and two (2) per each processing block (partition) within an SM. • Each Tensor Cores performs 64 FP FMA(fused multiplication and addition) operations per clock. • For deep learning training ,Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 compared to pascal. • For deep learning inference, Tensor Cores provide up to 6x higher peak TFLOPS on Tesla V100 when ompared to pascal. Summit Supercomputer Architecture 13 FIGURE 10: Pascal and Volta 4 x 4 matrix multiplication SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf
  • 14. Tensor cores • Each Tensor Core operates on a 4x4 matrix and performs the following operation: • D = A×B + C, where A, B, C, and D are 4x4 matrices. • Each FP16 multiply gives a full-precision product which is accumulated in a FP32 addition to provide the result. Summit Supercomputer Architecture 14 FIGURE 11: Tensor Core 4 x 4 Matrix Multiply and accumulate FIGURE 12: Mixed Precision Multiply and Accumulate in Tensor core
  • 15. Performance of Tensor Cores on Matrix Multiplications Summit Supercomputer Architecture 15 FIGURE 13: Single precision (FP32) FIGURE 14: Mixed precision
  • 16. NVIDIA NVLink • In Summit Supercomputer, the Tesla V100 accelerators and Power9 CPUs are connected with NVLink. • More performance when compared to PCLe interconnects. • Each link provides 25 Gigabytes/second in each direction. Summit Supercomputer Architecture 16 FIGURE 15: NVDIA NVLink
  • 17. Interconnect • Nodes are connected with Mellanox dual rail EDR InfiniBand network. • Each node it gives 25 GB/s Bandwidth . • Using dual-rail Mellanox EDR(Enhanced Data Rate) 100Gb/s InfiniBand interconnect for both storage and inter-process communications traffic • All nodes are interconnected with Non-Blocking Fat Tree topology. • Implemented by three level tree. Summit Supercomputer Architecture 17 FIGURE 16: ConnectX-5adapterandinterface withPOWER9 chips FIGURE 17: Fat Tree Topology
  • 18. Application- Finding the Drug Compounds to fight against the corona virus • Summit was used to screen through a library of 8000 datasets of known FDA approved drug compounds to fight against the corona virus. • Narrowed down the dataset to 77 in just 2 days. • Summit uses Virus genome to search for a very specific type of drug compounds. • On comparing with the world’s fastest computer Fugaku, which was used to conduct molecule level simulations. • narrowed from 2128 existing drugs and picked 12 drugs that bond easily to the proteins in 10 days. • Fugaku can perform more than 415 quadrillion computations a second which is 2.8 times faster than summit. Summit Supercomputer Architecture 18
  • 19. Comparison with other Supercomputers Summit Supercomputer Architecture 19 Rank Rmax Name Model Processor Cores Interconnect Memory Manufact urer Operating system Rpeak (PFLOPS) 1 415.530 FUGAKU SUPERCOMPUTER FUGAKU A64FX 48C 2.2GHz 7,299,072 Tofu interconnect D 4,866,048 GB Fujitsu Red Hat Enterprise Linux 513.855 2 148.6 SUMMIT IBM POWER SYSTEM AC922 IBM POWER9 22C 3.07GHz 2,414,592 Dual-rail Mellanox EDR Infiniband 2,801,664 GB IBM RHEL 7.4 200.795 3 94.640 SIERRA IBM POWER SYSTEM AC922 IBM POWER9 22C 3.07GHz 1,572,480 Dual-rail Mellanox EDR Infiniband 1,382,400 GB IBM RHEL 7.4 125.712 4 93.014 SUNWAY TAIHULIGHT SUNWAY MPP SUNWAY SW26010 260C 1.45 GHZ 10,649,600 Sunway 1,310,720 GB NRCPC Sunway RaiseOS 2.0.5 125.436
  • 20. Supercomputers development over the past 27 years Summit Supercomputer Architecture 20 CM-5 Supercomputer Fugaku Supercomputer Sunway Taihu Light Summit Supercomputer