SlideShare a Scribd company logo
Deep Learning Training At Scale
Spring Crest Deep Learning Accelerator
(Intel® Nervana™ NNP-T)
Andrew Yang | 8/16/2019
2
Legalnotices&disclaimers
• This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest
forecast, schedule, specifications and roadmaps.
• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system
can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to
evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
• Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances
will vary. Intel does not guarantee any costs or cost reduction.
• Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed
discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance The products
described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
• Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804.
• All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
• ​Performance results are based on testing as of August 1, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely
secure.
• Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or
component can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/.
• Intel, the Intel logo, Intel Inside, Nervana, and others are trademarks of Intel Corporation in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others.
• © 2019 Intel Corporation.
• Real-world performance - time to train and power
efficiency
• Compute used for largest models & training sets
doubles every 3.5 months**
• GEMMs and Convolutions dominate majority of
computation required for Deep Neural Networks.
• Primary factors that drive a DL training accelerator
• Power
• Compute
• Memory & Communication
• Scale-out
DeepLearningTraining
* Based on analysis of ResNet, FaceNet, Open NMT
Total Computation*
Other
MAC>99%
Output Layer
Input Layer
Hidden Layer 1
Hidden Layer 2 Hidden Layer 4
Hidden Layer 3
Deep Neural Network
**https://openai.com/blog/ai-and-compute/
• Train a network as fast as possible within a given
power budget, targeting larger models and datasets
• Balance between Compute, Communication, &
Memory
• Re-use on-die data as much as possible
• Optimize for batched workloads
• Built-in scale-out support
• Support future workloads
SpringCrest(NNP-T)ArchitectureDirection
• PCIe Gen 4 x16 EP
• 4x HBM2
• 64 lanes SerDes
• 24 Tensor Processors
• Up to 119 TOPS
• 60 MB on-chip
distributed memory
• Management CPU and
Interfaces
• 2.5D packaging
SpringCrest(NNP-T)SoC
Spring Crest (NNP-T) SoC
HBM
HBM
HBM
HBM
PCIe/DMA
PCIe Gen 4 x16
X-bar
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8x ICL
HBMPHY
HBMMC
TPC
TPC
TPC
TPC
TPC
HBMPHY
HBMMC
HBMPHY
HBMMC
HBMPHY
HBMMC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8x ICL
X-bar
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8xICL
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8xICL
• TSMC CLN16FF+
• 680mm2, 1200mm2 interposer
• 27 Billion Transistors
• 60mm x 60mm/6-2-6 3325 pin BGA
package
• 4x8GB HBM2-2400 memory
• Up to 1.1Ghz core frequency
• 64 lanes SerDes HSIO up to
3.58Tbps aggregate BW
• PCIe Gen 4 x16, SPI, I2C, GPIOs
• PCIe & OAM form factors
• Air-cooled, 150-250W typical
workload power
SpringCrest(NNP-T)Implementation
NNP-TSoftwareStack
• Full software stack built with open components
• Direct integration with DL frameworks
• nGraph: Hardware agnostic deep learning library
and compiler for DL platform developers.
• Provides common set of optimizations for NNP-T across
DL frameworks
• Argon: NNP-T DNN compute & communication
kernel library
• Low-level programmability: NNP-T kernel
development toolchain w/tensor compiler
Kernel Mode Driver
Board Firmware Chip Firmware
Argon
DNN Kernel Library
Open
Source
• Flexible and programmable Tensor-based ISA
• Limited instruction set
• Extensible with uController custom instructions
• Same distributed programming model for both intra and
inter-chip
• Explicit SW memory management and message passing
• Synchronization primitives
• Compute has affinity to local data
• DL workloads are dominated by a limited set of operations
NNP-TProgrammingModel
TensorProcessingCluster(TPC)
Local
Memory
Bank
Convolution
Engine
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
uController
Control Path
Neighbor
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
Neighbor
Neighbor
Local
Memory
Bank
Local
Memory
Bank
• Bfloat16 w/ FP32 accumulation
• No sacrifice in SOTA accuracy, improved power efficiency & training time*
• Minimal model changes
BFLoat16Numerics
Numeric format Area/Power efficient Easy to Converge? Notes
FP32 No Yes Industry standard
FP16 Yes Medium Good precision for Fprop.
Bprop/Update is more challenging.
16b Integer formats Yes No Better area/power than FP, but hard
to use
Bfloat16 Yes Yes Good at Fprop, Bprop, and Update
8/4/2/1b integer Extreme Extremely difficult Research areas
* Facebook and Intel joint paper - A Study of BFLOAT16 for Deep Learning Training: https://arxiv.org/pdf/1905.12322.pdf
• Bfloat16 Matrix Multiply Core (32x32)
• FP32 & BF16 support for all other operations
• 2x multiply cores per TPC to amortize SoC
resources
• Vector operations for non-GEMM
• Compound pipeline
• DL specific optimizations
• Activation functions, RNG, Reductions &
accumulations
• Programmable FP32 look-up tables
SpringCrestCompute
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
Out0 Out1 EW PP A B
• Four stacks of HBM2-2400 8GB devices
• 1.22TBps raw bandwidth and 32GB total device memory. ECC
protected
• 2.5MB/TPC of local scratchpad memory
• 60 MB total distributed memory, ECC protected
• Native Tensor Transpose
• Simultaneous read and write on each MRB
• 1.4TBps local read/write bandwidth per-TPC
• Support for direct memory to memory transfer for both HBM
and MRB
MemorySubsystem
SpringCrestOn-dieCommunication
• Bidirectional 2-D mesh architecture
to allow any to any communication
• Prioritized for throughput and
congestion avoidance
• Cut-through forwarding and multi-
cast support
• 2.6TBps total cross-sectional BW,
1.3TBps per-direction
• All peripheral devices shared
through the mesh (HBM, SerDes)
• Separate meshes for different
traffic types
• Support for direct peer to peer
communication between TPCs
PCI
Interchip X-BAR
OCR1,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
OCR4,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
Interchip XBAR
OCR3,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1 1
1 3
OCR3,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,1
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,6
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,1
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,6
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
• Run the largest models across multiple
chips and across chassis
• 16 quads of 112Gbps. 3.58Tbps total bi-
directional BW per chip
• Fully programmable router w/multi-cast
support enables multiple glue-less
topologies
• Reliable transmission
• Virtual channels and priorities for
traffic management
• Direct low-latency local memory transfer
• Support for up to 1024 nodes
Scale-Out
• DL workloads require a variety of GEMM
sizes
• Minimize HBM memory boundedness
with on-die data re-use => higher
utilization of compute resources
• Faster training, less idle resources
• ~2x better than published utilization of
competitive products
SpringCrestSinglechipgemmperformance
GEMM Size
Spring Crest
Utilization
1024 x 700 x 512 31.1%
1760 x 7133 x 1760 44.5%
2048 x 7133 x 2048 46.7%
2560 x 7133 x 2560 57.1%
4096 x 7133 x 4096 57.4%
5124 x 9124 x 2048 55.5%
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
** Other names and brands may be claimed as the property of others. Deepbench data from: https://github.com/baidu-research/DeepBench/blob/master/results/train/DeepBench_NV_V100.xlsx
Based on NVIDIA DGX-1, NVIDIA V100 GPU, Linux Kernel 4.4.0-124-generic, CUDA 10.0.130, CuDNN 7.3.1.20, NVIDIA Driver 410.48, Intel® Xeon® CPU ES-2698v4@2.2GHz
Description Spring Crest
Utilization
c64xh56xw56_k64xr3xs3_st1_n128 86%
c128xh28xw28_k128xr3xs3_st1_n128 71%
c512xh28xw28_k128xr1xs1_st1_n128 65%
c128xh28xw28_k512xr1xs1_st1_n128 59%
c256xh14xw14_k1024xr1xs1_st1_n128 62%
c256xh28xw28_k512xr1xs1_st2_n128 71%
c32xh120xw120_k64xr5xs5_st1_n128 87%
SpringCrestConvolutions
• Various convolution hyperparameters are required by DL workloads
• Support multiple tensor layouts for maximum on-die data reuse resulting higher
compute efficiency
0%
20%
40%
60%
80%
100%
Convolution Performance
C=# input dimensions, H=height, W=width, K=# filters, R=filter X, S=filter Y, ST=stride N=minibatch size
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Benchmarked on ring topology (intra and inter-
chassis)
• Support for different All-reduce algorithms with
different communication patterns
SpringCrestCommunicationperformance
Communication
Kernels
Bandwidth (BW)
Within-chassis
(8 cards)
Cross-chassis
(16, 32 cards)
Spring Crest
16x ICL
Spring Crest
16x ICL
2-card Send/Recv BW (GB/s) 161 161
Allreduce BW (GB/s) 151 151
Broadcast BW (GB/s) 147 147
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Low overhead and direct memory transfer result in low latency
• High efficiency even at moderate transfer sizes
• Cross-chassis scale-out with the same network/connectivity
Communicationperformance
Allreduce
Latency , 2KB
(ms)
Spring Crest
16x ICL
2 cards (in-chassis) 3
4 cards (in-chassis) 8
8 cards (in-chassis) 9
16 cards (cross-chassis) 30
32 cards (cross-chassis) 36
Allreduce Data Rate (GB/s)
Message
Size
Spring Crest
(8 chips)
Spring Crest
(32 chips)
1 MB 68.7 39.9
8 MB 115.8 92.2
32 MB 137.5 130.2
128 MB 147.1 147.4
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Domain specific acceleration has a place in DL training
• Training time and model size continue to be bottlenecks
• Numerics and compute tailored for DL
• No legacy workloads to support
• Architect from ground up to reduce data movement and keep
compute units fed
• Higher utilization and efficiency on micro-benchmarks translate into
better overall WL performance => Faster, more power efficient
training
ConcludingRemarks
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator

More Related Content

PDF
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
PDF
Intel Optane Data Center Persistent Memory
PDF
Lakefield: Hybrid Cores in 3D Package
PDF
9. intel prez sesiune hw
PPT
Informix IWA data life cycle mgmt & Performance on Intel.
PDF
Intel Itanium Hotchips 2011 Overview
PDF
Design of Software for Embedded Systems
PPT
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Intel Optane Data Center Persistent Memory
Lakefield: Hybrid Cores in 3D Package
9. intel prez sesiune hw
Informix IWA data life cycle mgmt & Performance on Intel.
Intel Itanium Hotchips 2011 Overview
Design of Software for Embedded Systems
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД

What's hot (20)

PDF
Accelerate Ceph performance via SPDK related techniques
PDF
Intel xeon e5v3 y sdi
PDF
4th gen intelcoreprocessor family
PDF
Intel® Xeon® Processor 5500 Series
PDF
The Next Generation of Intel: The Dawn of Nehalem
PDF
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
PPTX
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
PDF
IBM Redbooks Product Guide: IBM BladeCenter HS23E
PDF
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
PDF
Intel speed-select-technology-base-frequency-enhancing-performance
PPSX
Embedded systems
PPT
No[1][1]
PPTX
Hardware Software Codesign
PDF
Sa*ple
PDF
Reducing tco white paper rev5
PDF
Xeon E5 Making the Business Case PowerPoint
PDF
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
PPTX
Project based learning methodologies for Embedded Systems and Intelligent Sys...
PDF
Exam hp2 t17 aquestion 1
PDF
Higher Order Thinking - Question paper setting
Accelerate Ceph performance via SPDK related techniques
Intel xeon e5v3 y sdi
4th gen intelcoreprocessor family
Intel® Xeon® Processor 5500 Series
The Next Generation of Intel: The Dawn of Nehalem
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
IBM Redbooks Product Guide: IBM BladeCenter HS23E
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Intel speed-select-technology-base-frequency-enhancing-performance
Embedded systems
No[1][1]
Hardware Software Codesign
Sa*ple
Reducing tco white paper rev5
Xeon E5 Making the Business Case PowerPoint
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Project based learning methodologies for Embedded Systems and Intelligent Sys...
Exam hp2 t17 aquestion 1
Higher Order Thinking - Question paper setting
Ad

Similar to Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator (20)

PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
PDF
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
PDF
Intel Core X-seires processors
PPTX
Performance out of the box developers
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
2 new hw_features_cat_cod_etc
PDF
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
PDF
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
PDF
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
PDF
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
PPTX
E5 Intel Xeon Processor E5 Family Making the Business Case
PDF
Scaling python to_hpc_big_data-maidanov
PDF
Accelerating Virtual Machine Access with the Storage Performance Development ...
PDF
Python* Scalability in Production Environments
PDF
Intel python 2017
PDF
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
PDF
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
PPTX
Training - HPE and Intel Optane SSD Solution.PPTX
PDF
Develop, Deploy, and Innovate with Intel® Cluster Ready
PDF
3 additional dpdk_theory(1)
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Intel Core X-seires processors
Performance out of the box developers
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
2 new hw_features_cat_cod_etc
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
E5 Intel Xeon Processor E5 Family Making the Business Case
Scaling python to_hpc_big_data-maidanov
Accelerating Virtual Machine Access with the Storage Performance Development ...
Python* Scalability in Production Environments
Intel python 2017
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
Training - HPE and Intel Optane SSD Solution.PPTX
Develop, Deploy, and Innovate with Intel® Cluster Ready
3 additional dpdk_theory(1)
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator

  • 1. Deep Learning Training At Scale Spring Crest Deep Learning Accelerator (Intel® Nervana™ NNP-T) Andrew Yang | 8/16/2019
  • 2. 2 Legalnotices&disclaimers • This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. • Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. • Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. • All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. • ​Performance results are based on testing as of August 1, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure. • Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/. • Intel, the Intel logo, Intel Inside, Nervana, and others are trademarks of Intel Corporation in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others. • © 2019 Intel Corporation.
  • 3. • Real-world performance - time to train and power efficiency • Compute used for largest models & training sets doubles every 3.5 months** • GEMMs and Convolutions dominate majority of computation required for Deep Neural Networks. • Primary factors that drive a DL training accelerator • Power • Compute • Memory & Communication • Scale-out DeepLearningTraining * Based on analysis of ResNet, FaceNet, Open NMT Total Computation* Other MAC>99% Output Layer Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 4 Hidden Layer 3 Deep Neural Network **https://openai.com/blog/ai-and-compute/
  • 4. • Train a network as fast as possible within a given power budget, targeting larger models and datasets • Balance between Compute, Communication, & Memory • Re-use on-die data as much as possible • Optimize for batched workloads • Built-in scale-out support • Support future workloads SpringCrest(NNP-T)ArchitectureDirection
  • 5. • PCIe Gen 4 x16 EP • 4x HBM2 • 64 lanes SerDes • 24 Tensor Processors • Up to 119 TOPS • 60 MB on-chip distributed memory • Management CPU and Interfaces • 2.5D packaging SpringCrest(NNP-T)SoC Spring Crest (NNP-T) SoC HBM HBM HBM HBM PCIe/DMA PCIe Gen 4 x16 X-bar SerDesx8 SerDesx8 SerDesx8 SerDesx8 8x ICL HBMPHY HBMMC TPC TPC TPC TPC TPC HBMPHY HBMMC HBMPHY HBMMC HBMPHY HBMMC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC SerDesx8 SerDesx8 SerDesx8 SerDesx8 8x ICL X-bar SerDesx8 SerDesx8 SerDesx8 SerDesx8 8xICL SerDesx8 SerDesx8 SerDesx8 SerDesx8 8xICL
  • 6. • TSMC CLN16FF+ • 680mm2, 1200mm2 interposer • 27 Billion Transistors • 60mm x 60mm/6-2-6 3325 pin BGA package • 4x8GB HBM2-2400 memory • Up to 1.1Ghz core frequency • 64 lanes SerDes HSIO up to 3.58Tbps aggregate BW • PCIe Gen 4 x16, SPI, I2C, GPIOs • PCIe & OAM form factors • Air-cooled, 150-250W typical workload power SpringCrest(NNP-T)Implementation
  • 7. NNP-TSoftwareStack • Full software stack built with open components • Direct integration with DL frameworks • nGraph: Hardware agnostic deep learning library and compiler for DL platform developers. • Provides common set of optimizations for NNP-T across DL frameworks • Argon: NNP-T DNN compute & communication kernel library • Low-level programmability: NNP-T kernel development toolchain w/tensor compiler Kernel Mode Driver Board Firmware Chip Firmware Argon DNN Kernel Library Open Source
  • 8. • Flexible and programmable Tensor-based ISA • Limited instruction set • Extensible with uController custom instructions • Same distributed programming model for both intra and inter-chip • Explicit SW memory management and message passing • Synchronization primitives • Compute has affinity to local data • DL workloads are dominated by a limited set of operations NNP-TProgrammingModel
  • 10. • Bfloat16 w/ FP32 accumulation • No sacrifice in SOTA accuracy, improved power efficiency & training time* • Minimal model changes BFLoat16Numerics Numeric format Area/Power efficient Easy to Converge? Notes FP32 No Yes Industry standard FP16 Yes Medium Good precision for Fprop. Bprop/Update is more challenging. 16b Integer formats Yes No Better area/power than FP, but hard to use Bfloat16 Yes Yes Good at Fprop, Bprop, and Update 8/4/2/1b integer Extreme Extremely difficult Research areas * Facebook and Intel joint paper - A Study of BFLOAT16 for Deep Learning Training: https://arxiv.org/pdf/1905.12322.pdf
  • 11. • Bfloat16 Matrix Multiply Core (32x32) • FP32 & BF16 support for all other operations • 2x multiply cores per TPC to amortize SoC resources • Vector operations for non-GEMM • Compound pipeline • DL specific optimizations • Activation functions, RNG, Reductions & accumulations • Programmable FP32 look-up tables SpringCrestCompute 32x32 Mult Array PreOpPreOp Post Ops Partial Product Out0 Out1 EW PP A B
  • 12. • Four stacks of HBM2-2400 8GB devices • 1.22TBps raw bandwidth and 32GB total device memory. ECC protected • 2.5MB/TPC of local scratchpad memory • 60 MB total distributed memory, ECC protected • Native Tensor Transpose • Simultaneous read and write on each MRB • 1.4TBps local read/write bandwidth per-TPC • Support for direct memory to memory transfer for both HBM and MRB MemorySubsystem
  • 13. SpringCrestOn-dieCommunication • Bidirectional 2-D mesh architecture to allow any to any communication • Prioritized for throughput and congestion avoidance • Cut-through forwarding and multi- cast support • 2.6TBps total cross-sectional BW, 1.3TBps per-direction • All peripheral devices shared through the mesh (HBM, SerDes) • Separate meshes for different traffic types • Support for direct peer to peer communication between TPCs PCI Interchip X-BAR OCR1,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 OCR4,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 Interchip XBAR OCR3,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,1 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,6 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,1 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,6 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7
  • 14. • Run the largest models across multiple chips and across chassis • 16 quads of 112Gbps. 3.58Tbps total bi- directional BW per chip • Fully programmable router w/multi-cast support enables multiple glue-less topologies • Reliable transmission • Virtual channels and priorities for traffic management • Direct low-latency local memory transfer • Support for up to 1024 nodes Scale-Out
  • 15. • DL workloads require a variety of GEMM sizes • Minimize HBM memory boundedness with on-die data re-use => higher utilization of compute resources • Faster training, less idle resources • ~2x better than published utilization of competitive products SpringCrestSinglechipgemmperformance GEMM Size Spring Crest Utilization 1024 x 700 x 512 31.1% 1760 x 7133 x 1760 44.5% 2048 x 7133 x 2048 46.7% 2560 x 7133 x 2560 57.1% 4096 x 7133 x 4096 57.4% 5124 x 9124 x 2048 55.5% * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. ** Other names and brands may be claimed as the property of others. Deepbench data from: https://github.com/baidu-research/DeepBench/blob/master/results/train/DeepBench_NV_V100.xlsx Based on NVIDIA DGX-1, NVIDIA V100 GPU, Linux Kernel 4.4.0-124-generic, CUDA 10.0.130, CuDNN 7.3.1.20, NVIDIA Driver 410.48, Intel® Xeon® CPU ES-2698v4@2.2GHz
  • 16. Description Spring Crest Utilization c64xh56xw56_k64xr3xs3_st1_n128 86% c128xh28xw28_k128xr3xs3_st1_n128 71% c512xh28xw28_k128xr1xs1_st1_n128 65% c128xh28xw28_k512xr1xs1_st1_n128 59% c256xh14xw14_k1024xr1xs1_st1_n128 62% c256xh28xw28_k512xr1xs1_st2_n128 71% c32xh120xw120_k64xr5xs5_st1_n128 87% SpringCrestConvolutions • Various convolution hyperparameters are required by DL workloads • Support multiple tensor layouts for maximum on-die data reuse resulting higher compute efficiency 0% 20% 40% 60% 80% 100% Convolution Performance C=# input dimensions, H=height, W=width, K=# filters, R=filter X, S=filter Y, ST=stride N=minibatch size * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 17. • Benchmarked on ring topology (intra and inter- chassis) • Support for different All-reduce algorithms with different communication patterns SpringCrestCommunicationperformance Communication Kernels Bandwidth (BW) Within-chassis (8 cards) Cross-chassis (16, 32 cards) Spring Crest 16x ICL Spring Crest 16x ICL 2-card Send/Recv BW (GB/s) 161 161 Allreduce BW (GB/s) 151 151 Broadcast BW (GB/s) 147 147 * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 18. • Low overhead and direct memory transfer result in low latency • High efficiency even at moderate transfer sizes • Cross-chassis scale-out with the same network/connectivity Communicationperformance Allreduce Latency , 2KB (ms) Spring Crest 16x ICL 2 cards (in-chassis) 3 4 cards (in-chassis) 8 8 cards (in-chassis) 9 16 cards (cross-chassis) 30 32 cards (cross-chassis) 36 Allreduce Data Rate (GB/s) Message Size Spring Crest (8 chips) Spring Crest (32 chips) 1 MB 68.7 39.9 8 MB 115.8 92.2 32 MB 137.5 130.2 128 MB 147.1 147.4 * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 19. • Domain specific acceleration has a place in DL training • Training time and model size continue to be bottlenecks • Numerics and compute tailored for DL • No legacy workloads to support • Architect from ground up to reduce data movement and keep compute units fed • Higher utilization and efficiency on micro-benchmarks translate into better overall WL performance => Faster, more power efficient training ConcludingRemarks