SlideShare a Scribd company logo
Proprietary and confidential. Do not distribute.
Nervana and the
Future of Computing
26 April 2016
Arjun Bansal
Co-founder & VP Algorithms, Nervana
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.
AI on demand using Deep Learning
2
DL
Image
Classification
Object
Localization
Video
Indexing
Text
Analysis
Nervana Platform
Machine
Translation
Proprietary and confidential. Do not distribute.
Image classification and video activity detection
3
Deep learning model Potential applications
• Trained on a public dataset1 of
13K videos in 100 categories
• Training was approximately 3
times faster than competitive
framework
• Can be extended to perform
scene and object detection,
action similarity labeling, video
retrieval, anomaly detection
1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php
• Activity detection and
monitoring for security
• Automatic editing of captured
moments from video camera
• Facial recognition and image
based retrieval
• Sense and avoid systems for
autonomous driving
• Baggage screening at airports
and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
Proprietary and confidential. Do not distribute.ner va na
Object localization and recognition
4
Proprietary and confidential. Do not distribute.ner va na
Speech to text
5
https://youtu.be/NaqZkV_fBIM
Proprietary and confidential. Do not distribute.ner va na
Question answering
6
Stories
Mary journeyed to Texas.
John went to Maryland.
Mary went to Iowa.
John travelled to Florida.
Questions
Answers
Where is John located?
Florida
Proprietary and confidential. Do not distribute.ner va na
Reinforcement learning
7
Pong Breakout
https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
Proprietary and confidential. Do not distribute.ner va na
Application areas
8
Healthcare Agriculture Finance
Online Services Automotive Energy
Proprietary and confidential. Do not distribute.
Nervana is building the future of computing
9
The Economist, March 12, 2016
Cloud Computing
Custom ASIC
Deep Learning / AI
Proprietary and confidential. Do not distribute.ner va na
nervana cloud
10
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support Models
• Convnet
• RNN, LSTM
• MLP
• DQN
• NTM
Domains
• Images
• Video
• Speech
• Text
• Time series
Proprietary and confidential. Do not distribute.ner va na
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
11
• Fastest library
• Model support
• Cloud integration
Proprietary and confidential. Do not distribute.ner va na
Backends
• CPU
• GPU
• Multiple GPUs
• Parameter server
• (Xeon Phi)
• nervana TPU
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
• Optimized at assembler level
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
=1
nervana
engine
10 GPUs
200 CPUs
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
Instruction
and data
memory
Ctrl
ALU
CPU
Data
Memory
Ctrl
Nervana
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• 10-100x gain
• Architecture optimized for
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ner va na
Special purpose computation
13
1940s: Turing Bombe
Motivation: Automating
calculations, code breaking
Proprietary and confidential. Do not distribute.ner va na
General purpose computation
14
2000s: SoC
Motivation: reduce power
and cost, fungible
computing.
Enabled inexpensive
mobile devices.
Proprietary and confidential. Do not distribute.ner va na
Dennard scaling has ended
15
What business and
technology constraints do
we have now?
Proprietary and confidential. Do not distribute.ner va na
Many-core tiled architectures
16
Tile Processor Architecture Overview for the TILEPro Series 5
and provides high bandwidth and extremely low latency communication among tiles. The Tile
Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-
ble multicore processor. External memory and I/O interfaces are connected to the tiles via the
iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s
structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-
ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a
three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC),
cache, and DMA subsystem. An individual tile is capable of executing up to three operations per
cycle.
CDN
TDN
IDN
MDN
STN
UDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI
(10GbE)
TDN
IDN
MDN
STN
UDN
LEGEND:
Tile Detail
port2
msh0
port0
port2 port1 port0
DDR2
DDR2
port0
msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII
(GbE)
XAUI
(10GbE)
FlexI/O
PCIe
(x4 lane)
I2C, JTAG,
HPI, UART,
SPI ROM
FlexI/O
PCIe
(x4 lane)
port1 port1
msh3 msh2
port2
msh0
port0
port2 port1 port0
port0
msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port0
7,0
port0
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
Switch
Engine
Cache
Engine
Processor
Engine
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
STNSTN
TDNTDN
IDNIDN
MDNMDN
UDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased
performance without clock
rate increase or smaller
devices.
Requires changes in
programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi
Knight’s landing
Proprietary and confidential. Do not distribute.ner va na
FPGA architectures
17
Altera Arria 10
Motivation: fine grained
parallelism, reconfigurable,
lots of IO, scalable.
Slow clock speed, lacks
compute density for
machine learning.
Proprietary and confidential. Do not distribute.ner va na
Neuromorphic architectures
18
IBM TrueNorth
dress for the target axon and
addresses representing core
ension to the target core). This
coded into a packet that is in-
entering spikes (Fig. 2I). Spikes leaving the mesh
are tagged with their row (for spikes traveling
east-west) or column (for spikes traveling north-
south) before being merged onto a shared link
ters (31,232 bits), destination addresses (6656
bits), and axonal delays (1024 bits). In terms of
efficiency, TrueNorth’s power density is 20 mW
per cm2
, whereas that of a typical central processing
Proprietary and confidential. Do not distribute.ner va na
Neural network parallelism
20
Data chunk 1 Data chunk n
…
Processor 1 Processor n
…
parameter server
Full deep
network on
each processor
Parameter coordination
Data parallelism Model parallelism
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
nervana compute topology
22
CPU
CPU
S
S
D
IB
10
G
S
S
D
IB
10
G
nn
n n
nn
nn
PCIE SW
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Distributed linear algebra and convolution
23
02/27/2014! CS267 Lecture 12! 50!
52!
SUMMA – n x n matmul on P1/2 x P1/2 grid
•  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!
•  A[i,k] is n/P1/2 x b submatrix of A!
•  B[k,j] is b x n/P1/2 submatrix of B !
•  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !
•  summation over submatrices!
•  Need not be square processor grid !
* =
i"
j"
A[i,k]"
k"
k"
B[k,j]"
C[i,j]
02/27/2014! CS267 Lecture 12!
SUMMA distributed matrix multiply C=A*B
(Jim Demmel, CS267 lecture notes)
Matrix multiplication on multidimensional torus networks
Edgar Solomonik and James Demmel
Division of Computer Science
University of California at Berkeley, CA, USA
solomon@cs.berkeley.edu, demmel@cs.berkeley.edu
Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have
a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of
Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.
This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection
bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
Proprietary and confidential. Do not distribute.ner va na
Summary
24
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better

More Related Content

PDF
Startup.Ml: Using neon for NLP and Localization Applications
PDF
Rethinking computation: A processor architecture for machine intelligence
PDF
Deep Learning at Scale
PDF
Urs Köster Presenting at RE-Work DL Summit in Boston
PDF
Introduction to Deep Learning and neon at Galvanize
PDF
ODSC West
PDF
Urs Köster - Convolutional and Recurrent Neural Networks
PPTX
Intel Nervana Artificial Intelligence Meetup 11/30/16
Startup.Ml: Using neon for NLP and Localization Applications
Rethinking computation: A processor architecture for machine intelligence
Deep Learning at Scale
Urs Köster Presenting at RE-Work DL Summit in Boston
Introduction to Deep Learning and neon at Galvanize
ODSC West
Urs Köster - Convolutional and Recurrent Neural Networks
Intel Nervana Artificial Intelligence Meetup 11/30/16

What's hot (20)

PDF
Introduction to deep learning @ Startup.ML by Andres Rodriguez
PPTX
Deep Learning for Robotics
PDF
Introduction to Deep Learning with Will Constable
PDF
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
PDF
Deep learning on spark
PPTX
Nervana Systems
PDF
Using neon for pattern recognition in audio data
PDF
RE-Work Deep Learning Summit - September 2016
PDF
Intel Nervana Artificial Intelligence Meetup 1/31/17
PPTX
Squeezing Deep Learning Into Mobile Phones
PDF
A Platform for Accelerating Machine Learning Applications
PDF
Large Scale Deep Learning with TensorFlow
PDF
Improving Hardware Efficiency for DNN Applications
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
PDF
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
PDF
Introduction to Deep Learning (NVIDIA)
PPTX
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
PDF
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
PDF
Deep Learning Computer Build
PPTX
Deep learning on mobile
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Deep Learning for Robotics
Introduction to Deep Learning with Will Constable
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
Deep learning on spark
Nervana Systems
Using neon for pattern recognition in audio data
RE-Work Deep Learning Summit - September 2016
Intel Nervana Artificial Intelligence Meetup 1/31/17
Squeezing Deep Learning Into Mobile Phones
A Platform for Accelerating Machine Learning Applications
Large Scale Deep Learning with TensorFlow
Improving Hardware Efficiency for DNN Applications
Moving Toward Deep Learning Algorithms on HPCC Systems
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
Introduction to Deep Learning (NVIDIA)
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
Deep Learning Computer Build
Deep learning on mobile
Ad

Viewers also liked (14)

PDF
An Analysis of Convolution for Inference
PDF
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
PDF
Google I/O 2016 Highlights That You Should Know
PDF
Video Activity Recognition and NLP Q&A Model Example
PDF
GPUDirect RDMA and Green Multi-GPU Architectures
PDF
Anil Thomas - Object recognition
PPT
Region Of Interest Extraction
PDF
High-Performance GPU Programming for Deep Learning
PDF
Evolution of Supermicro GPU Server Solution
PPTX
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PDF
Object Detection and Recognition
PDF
AWS CLOUD 2017 - AWS 신규 서비스를 통해 본 클라우드의 미래 (김봉환 솔루션즈 아키텍트)
PDF
Aeroprobing A.I. Drone with TX1
An Analysis of Convolution for Inference
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Google I/O 2016 Highlights That You Should Know
Video Activity Recognition and NLP Q&A Model Example
GPUDirect RDMA and Green Multi-GPU Architectures
Anil Thomas - Object recognition
Region Of Interest Extraction
High-Performance GPU Programming for Deep Learning
Evolution of Supermicro GPU Server Solution
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Object Detection and Recognition
AWS CLOUD 2017 - AWS 신규 서비스를 통해 본 클라우드의 미래 (김봉환 솔루션즈 아키텍트)
Aeroprobing A.I. Drone with TX1
Ad

Similar to Nervana and the Future of Computing (20)

PDF
Nervana AI Overview Deck April 2016
PPT
Deep Learning Jeff-Shomaker_1-20-17_Final_
PPTX
AI on the Edge
PPTX
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
PPTX
Towards AGI Berlin - Building AGI, May 2019
PPTX
Towards AGI - Berlin May 2019
PDF
Open source ai_technical_trend
PDF
Deep Learning Hardware: Past, Present, & Future
PPTX
Ibm truenorth
PDF
Fueling the AI Revolution with Gaming
PDF
Alison B Lowndes - Fueling the Artificial Intelligence Revolution with Gaming...
PPTX
Micron: Seamless Prediction at the Edge Using TensorFlow on FPGAs
PDF
GTC Taiwan 2017 主題演說
PDF
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
PDF
Imaging automotive 2015 addfor v002
PDF
Imaging automotive 2015 addfor v002
PPTX
Learn about Tensorflow for Deep Learning now! Part 1
PDF
NVIDIA @ Infinite Conference, London
PPT
Dl 0n mobile jeff shomaker_jan-2018_final
PDF
Alison Lowndes, Artificial Intelligence DevRel, Nvidia – Fueling the Artifici...
Nervana AI Overview Deck April 2016
Deep Learning Jeff-Shomaker_1-20-17_Final_
AI on the Edge
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Towards AGI Berlin - Building AGI, May 2019
Towards AGI - Berlin May 2019
Open source ai_technical_trend
Deep Learning Hardware: Past, Present, & Future
Ibm truenorth
Fueling the AI Revolution with Gaming
Alison B Lowndes - Fueling the Artificial Intelligence Revolution with Gaming...
Micron: Seamless Prediction at the Edge Using TensorFlow on FPGAs
GTC Taiwan 2017 主題演說
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
Imaging automotive 2015 addfor v002
Imaging automotive 2015 addfor v002
Learn about Tensorflow for Deep Learning now! Part 1
NVIDIA @ Infinite Conference, London
Dl 0n mobile jeff shomaker_jan-2018_final
Alison Lowndes, Artificial Intelligence DevRel, Nvidia – Fueling the Artifici...

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology

Nervana and the Future of Computing

  • 1. Proprietary and confidential. Do not distribute. Nervana and the Future of Computing 26 April 2016 Arjun Bansal Co-founder & VP Algorithms, Nervana MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute. AI on demand using Deep Learning 2 DL Image Classification Object Localization Video Indexing Text Analysis Nervana Platform Machine Translation
  • 3. Proprietary and confidential. Do not distribute. Image classification and video activity detection 3 Deep learning model Potential applications • Trained on a public dataset1 of 13K videos in 100 categories • Training was approximately 3 times faster than competitive framework • Can be extended to perform scene and object detection, action similarity labeling, video retrieval, anomaly detection 1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php • Activity detection and monitoring for security • Automatic editing of captured moments from video camera • Facial recognition and image based retrieval • Sense and avoid systems for autonomous driving • Baggage screening at airports and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
  • 4. Proprietary and confidential. Do not distribute.ner va na Object localization and recognition 4
  • 5. Proprietary and confidential. Do not distribute.ner va na Speech to text 5 https://youtu.be/NaqZkV_fBIM
  • 6. Proprietary and confidential. Do not distribute.ner va na Question answering 6 Stories Mary journeyed to Texas. John went to Maryland. Mary went to Iowa. John travelled to Florida. Questions Answers Where is John located? Florida
  • 7. Proprietary and confidential. Do not distribute.ner va na Reinforcement learning 7 Pong Breakout https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
  • 8. Proprietary and confidential. Do not distribute.ner va na Application areas 8 Healthcare Agriculture Finance Online Services Automotive Energy
  • 9. Proprietary and confidential. Do not distribute. Nervana is building the future of computing 9 The Economist, March 12, 2016 Cloud Computing Custom ASIC Deep Learning / AI
  • 10. Proprietary and confidential. Do not distribute.ner va na nervana cloud 10 Images Text Tabular Speech Time series Video Data import trainbuild deploy Cloud
  • 11. Proprietary and confidential. Do not distribute.ner va na nervana neon 11
  • 12. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  • 13. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  • 14. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM Domains • Images • Video • Speech • Text • Time series
  • 15. Proprietary and confidential. Do not distribute.ner va na Running locally: % python rnn.py # or neon rnn.yaml Running in nervana cloud: % ncloud submit —py rnn.py # or —yaml rnn.yaml % ncloud show <model_id> % ncloud list % ncloud deploy <model_id> % ncloud predict <model_id> <data> # or use REST api nervana neon 11 • Fastest library • Model support • Cloud integration
  • 16. Proprietary and confidential. Do not distribute.ner va na Backends • CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends
  • 17. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends • Optimized at assembler level
  • 18. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12
  • 19. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density =1 nervana engine 10 GPUs 200 CPUs
  • 20. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture
  • 21. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation Instruction and data memory Ctrl ALU CPU Data Memory Ctrl Nervana
  • 22. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference
  • 23. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision
  • 24. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  • 25. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • 10-100x gain • Architecture optimized for • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  • 26. Proprietary and confidential. Do not distribute.ner va na Special purpose computation 13 1940s: Turing Bombe Motivation: Automating calculations, code breaking
  • 27. Proprietary and confidential. Do not distribute.ner va na General purpose computation 14 2000s: SoC Motivation: reduce power and cost, fungible computing. Enabled inexpensive mobile devices.
  • 28. Proprietary and confidential. Do not distribute.ner va na Dennard scaling has ended 15 What business and technology constraints do we have now?
  • 29. Proprietary and confidential. Do not distribute.ner va na Many-core tiled architectures 16 Tile Processor Architecture Overview for the TILEPro Series 5 and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma- ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect. Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure. Figure 2-1. Tile Processor Hardware Architecture Each tile is a powerful, full-featured computing system that can independently run an entire oper- ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle. CDN TDN IDN MDN STN UDN 1,1 6,1 3,2 4,2 5,2 6,2 7,2 XAUI (10GbE) TDN IDN MDN STN UDN LEGEND: Tile Detail port2 msh0 port0 port2 port1 port0 DDR2 DDR2 port0 msh1 port2 port0 port1 port2 DDR2 DDR2 RGMII (GbE) XAUI (10GbE) FlexI/O PCIe (x4 lane) I2C, JTAG, HPI, UART, SPI ROM FlexI/O PCIe (x4 lane) port1 port1 msh3 msh2 port2 msh0 port0 port2 port1 port0 port0 msh1 port2 port0 port1 port2 port1 port1 msh3 msh2 gpio1 port0 port1 port1 port0 port1 xgbe0 gbe0 xgbe1 port0 gpio1 port1 port0 port1 gbe1 port0 port1 xgbe0 xgbe1 port0 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 7,00,0 1,0 2,0 3,0 4,0 5,0 6,0 0,1 1,1 6,12,1 3,1 4,1 5,1 7,1 3,2 4,2 5,2 6,2 7,20,2 1,2 2,2 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 port0 7,0 port0 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 Switch Engine Cache Engine Processor Engine U D N S T N M D N I D N T D N C D N U D N S T N M D N I D N T D N C D N STNSTN TDNTDN IDNIDN MDNMDN UDNUDN CDNCDN 2010s: multi-core, GPGPU Motivation: increased performance without clock rate increase or smaller devices. Requires changes in programming paradigm. NVIDIA GM204Tilera Intel Xeon Phi Knight’s landing
  • 30. Proprietary and confidential. Do not distribute.ner va na FPGA architectures 17 Altera Arria 10 Motivation: fine grained parallelism, reconfigurable, lots of IO, scalable. Slow clock speed, lacks compute density for machine learning.
  • 31. Proprietary and confidential. Do not distribute.ner va na Neuromorphic architectures 18 IBM TrueNorth dress for the target axon and addresses representing core ension to the target core). This coded into a packet that is in- entering spikes (Fig. 2I). Spikes leaving the mesh are tagged with their row (for spikes traveling east-west) or column (for spikes traveling north- south) before being merged onto a shared link ters (31,232 bits), destination addresses (6656 bits), and axonal delays (1024 bits). In terms of efficiency, TrueNorth’s power density is 20 mW per cm2 , whereas that of a typical central processing
  • 32. Proprietary and confidential. Do not distribute.ner va na Neural network parallelism 20 Data chunk 1 Data chunk n … Processor 1 Processor n … parameter server Full deep network on each processor Parameter coordination Data parallelism Model parallelism
  • 33. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  • 34. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  • 35. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  • 36. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  • 37. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G
  • 38. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  • 39. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  • 40. Proprietary and confidential. Do not distribute.ner va na nervana compute topology 22 CPU CPU S S D IB 10 G S S D IB 10 G nn n n nn nn PCIE SW PCIE SW
  • 41. Proprietary and confidential. Do not distribute.ner va na Distributed linear algebra and convolution 23 02/27/2014! CS267 Lecture 12! 50! 52! SUMMA – n x n matmul on P1/2 x P1/2 grid •  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij! •  A[i,k] is n/P1/2 x b submatrix of A! •  B[k,j] is b x n/P1/2 submatrix of B ! •  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] ! •  summation over submatrices! •  Need not be square processor grid ! * = i" j" A[i,k]" k" k" B[k,j]" C[i,j] 02/27/2014! CS267 Lecture 12! SUMMA distributed matrix multiply C=A*B (Jim Demmel, CS267 lecture notes) Matrix multiplication on multidimensional torus networks Edgar Solomonik and James Demmel Division of Computer Science University of California at Berkeley, CA, USA solomon@cs.berkeley.edu, demmel@cs.berkeley.edu Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
  • 42. Proprietary and confidential. Do not distribute.ner va na Summary 24 • Computers are tools for solving problems of their time • Was: Coding, calculation, graphics, web • Today: Learning and Inference on data • Deep learning as a computational paradigm • Custom architecture can do vastly better