SlideShare a Scribd company logo
Shinya Morino, Sr. Solution Architect, NVIDIA, 2/14/2020
QGATE 0.3:
QUANTUM CIRCUIT SIMULATOR
2
NVIDIA AI TECHNOLOGY CENTER (NVAITC)
Catalyse AI transformation through Research-Centric Integrated Engagements
Singapore (AP HQ)
Taiwan
China
Australia
Hong Kong
Luxembourg
Established Aug 2015 in Singapore
Collaboration Footprint: Singapore. ASEAN. Taiwan. China. Hong Kong. Australia. Europe.
Thailand
London
Indonesia
3
QUANTUM COMPUTING
Qubit (Quantum bit):
- The basic unit of quantum computers.
- Qubits are represented as a linear superposition of
two basis states, |0> and |1>.
ۧ|𝜓 = 𝛼 ۧ|0 + 𝛽 ۧ|1
𝛼 2
+ 𝛽 2
= 1
- |0> or|1> is observed by measurement.
Observation probabilities of |0> and |1> are 𝛼 2
and
𝛽 2
respectively.
Qubit
ۧ|0 = cos
𝜃
2
, ۧ|1 = 𝑒 𝑖𝜙 sin
𝜃
2
4
QUANTUM COMPUTING
Quantum circuits consist with qubits and quantum logic gates.
- With N qubits, 2N states can be represented (if entangled).
- One quantum state corresponds to one complex number.
Ex. With 53 qubits, 253 ( 10 Peta) states can be represented.
Quantum states are controlled by using quantum logic gates.
- Applying one gate can change 2N qubit states at the same
time.
- Developing quantum circuits is the programming for quantum
computing
Quantum circuit
H
H
H
H
5
QUANTUM CIRCUIT SIMULATION
State vector
- Quantum states are expended to a vector of
complex numbers
- Vector size is 2N for N-qubit circuits.
- Each bit in index is corresponding to one qubit.
Quantum states and state vector
𝑠0
𝑠1
𝑠2
⋮
𝑠2 𝑁
−2
𝑠2 𝑁
−1
ۧ|0 … 00
ۧ|0 … 01
ۧ|0 … 10
⋮
ۧ|1 … 10
ۧ|1 … 11
index of state vector
Quantum state
(complex number)
q0q1qN-1 …
Qubits
6
Represented as a 2x2 unitary matrix
Applying quantum gate to a state vector.
QUANTUM CIRCUIT SIMULATION
Quantum Logic Gate
U 𝑈 =
𝑢00 𝑢01
𝑢10 𝑢11
𝑠𝑖+1,| ۧ…𝟎…
𝑠𝑖+1,| ۧ…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟎…
𝑠𝑖,| ۧ…𝟏…
Gate
U =
1 0
0 1
0 0
0 0
0 0
0 0
u00 u01
u10 u11
U
Control
Target
Gate is applied when controlling gbit is |1>.
Control gates can make qubits entangled.
𝑠𝑖+1,| ۧ…𝟏…𝟎…
𝑠𝑖+1,| ۧ…𝟏…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟏…𝟎…
𝑠𝑖,| ۧ…𝟏…𝟏…
7
It’s said …
“Number of qubits” is the limitation,
because vast amount of memory proportional to 2N, is required for simulations.
PROBLEM DEFINITION
Quantum circuit simulator is an essential tool to develop quantum circuits, but there’re
limitations:
But actual issue as of today is:
“Simulation is very slow.”
Needing long time for debugging and verifying quantum circuits
8
QUANTUM CIRCUIT EXAMPLES
Circuit # qubits # gates
Capacity of
State vector
Estimated simulation time
Python*1
(CPU 1core)
CPU*2
(multi-core)
Quantum Volume*3
(width 32, depth 32)
32 5,120 64 GB 2 days 3 hours
iQFT *4
(Ex: 32 qubits)
32 560 64 GB 3 hours 13 min
Modulo operation
( 5n mod 12 )
27 5,449 2 GB 45 min 3 min
*1: Simulation with 1 cpu core. *2: Assuming 55 GB/sec of CPU memory bandwidth with naïve simulation algorithm.
*3: https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm,
*4: iQFT, Inversed Quantum Fourier Transform,
9
QUANTUM CIRCUIT SIMULATOR
QGATE
11
QGATE DESIGN CONCEPT
1. Easy development of quantum circuits with fast simulations for experiments
Rich built-in gate set to quickly develop circuits
Utilizing modern computing devices for performance
2. Single node, Multi GPU (multi devices)
Utilizing a big server with a huge amount of memory.
Focusing on performance. No intra-node communication.
3. Works as backends of other SDKs
Simulations can be accelerated on Blueqat, various SDKs.
12
1. EASY DEVELOPMENT OF QUANTUM CIRCUITS
Rich built-in gate set
- Multi-bit-controlled gates, such as Toffoli gate is included in built-in gate set
- Adjoint for all gates
All qubits are fully connected
IBM’s OpenQASM gate set is also supported
13
BUILT-IN OPERATORS
Quantum logic gate Symbol
Identity I
Hadamard gate H
Pauli gates and their rotations X, Y, Z, Rx(theta), Ry(theta), Rz(theta)
Exponential of identity and Pauli gates Exp(I, X, Y, Z)
Global phase Exp(theta)
Phase shift gates P(theta), T, S
Measurement, Probability Measure(qubit), Prob(qubit)
Extensions
OpenQASM’s U gates U3, U2, U1
Multi qubit measurement Measure(pauli gates)
14
UTILIZING MODERN COMPUTING DEVICES
FOR PERFORMANCE
Tesla V100 (SXM2)
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
32 GB HBM2 @ 900GB/s | 300GB/s NVLink
GPU CPU
CPU runtime is also implemented.
(Utilizing multi cores in one CPU socket)
15
TARGET HARDWARE
Requirement:
- Quantum circuit simulations need a
huge amount of memory
- Performance is important as well.
DGX-2
- 512 GB of GPU memory in 16 Tesla
V100
- By using NVLink, all memories in
GPUs are in one address space.
NVIDIA DGX-2
16
DGX-2
All GPUs are sharing a single address space.
All-to-all connections by NVLink
(300 GB/sec, bidirectional)
- 512 GB of ultra-fast memory
is available
- FP32: 35 qubits
FP64: 34 qubits
16 NVIDIA High-end GPUs + NVLink2
17
At a Glance
GPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB LRDIMM DDR4
Storage
Data: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10 Gb LAN
Display 3x DisplayPort, 4K Resolution
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
17
NVIDIA DGX STATION
18
DGX STATION NVLINK NETWORK TOPOLOGY
For Efficient Application Scaling
NVIDIA NVLink Bridge
- Four NVIDIA Tesla V100 accelerators
- Each Tesla V100 GPU in DGX Station has four
NVLink connection points, each providing a point-
to-point connection to another GPU at a peak
bandwidth of 25GB/s
- Optimized for:
- The bandwidth achievable for a variety of point-
to-point and collective communications primitives
- The flexibility of the topology
- Performance with a subset of the GPUs
19
GPU REQUIREMENT
Qgate runs with a single GPU, and scales to multiple GPUs in a single node.
- Works with Kepler GPU (Cc3.5) or later. Recommendation is Maxwell GPU (Cc5.0) or later.
Multi GPU requirement
- NVLink : All-to-all NVLink connections between GPUs are required.
For performance, NVLink is strongly recommended.
- PCIe: All GPUs should be connected to the same PCIe root complex.
CPU
- Running with 1 CPU socket is supported. There’s no consideration for NUMA.
20
PERFORMANCE MEASUREMENT
Quantum circuit for measurement
- 10 Hadamard gates are placed on each qubit.
- FP64 is used.
Baseline, Single GPU Performance
H
H
H
H
H
H
H
H
H
H
H
H
...
...
...
Device
CPU (1 core)*1 Single thread on CPU
CPU (multi-core)
Multi-threaded*2 on CPU
(40 threads, 20 physical cores)
GPU GPU / CUDA
10 Hadamard gates
*1: CPU(1 core) is a model of python-based simulator which is
sometimes implemented by using 1 CPU core.
*2: Implemented by using C++ STL’s thread class
21
SUMMARY
Performance Baseline (30 qubits, Single GPU)
# gates applied in
sec.
Memory bandwidth Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-core) 1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
22
PROCESSING PIPELINE
(0.3 RELEASE)
23
PROCESSING PIPELINE
Built with Python and Native Extensions
Gate cancellation
Runtime
Removing cancelling gates
Dynamically grouping qubits, Reducing number of variables
required to represent quantum states
Reordering operators (gates and measurements)
in order to maximize effects of dynamic qubit grouping.
Parallelization on computing devices
CPU(multi-core), and GPU(CUDA)
Python
Input (Intermediate repr.)
Native
extension
Output (state vector)
Operator reordering
Dynamic qubit grouping
Quantum
computing
specific
Device
specific
Reordering qubits to reduce data transfer between devices.Qubit reordering
24
Backend
SOFTWARE DIAGRAM
qgate.model
Quantum circuit object model
Built-in gate definitions
qgate.simulator.runtimeqgate.simulator
Simulator
qgate.script
Circuit definition on python
qgate.openqasm
Importing OpenQASM files
qgate.simulator.qubits
State vector
Complex number
probability
Other plugins …
Frontend
Plugin
Blueqat plugin
qgate
pyruntime:
Python, reference
cpuruntime:
CPU, multi-core
cudaruntime:
CUDA, GPU
OM (object model)
Analyses and optimizations for
quantum circuits
Runtime
Accelerating numerical
operations
25
Products of some gate pairs cancel out
𝐼 = 𝑋 ∙ 𝑋 = 𝑌 ∙ 𝑌 = 𝑍 ∙ 𝑍 = 𝐻 ∙ 𝐻
GATE CANCELLATION
Quantum Circuit Optimization
U
U
U
X
U: Arbitrary unitary gate
X U
X XX
Ex: Modulo arithmetic*
(5^x mod 12, 27 qubits)Cancel out
Cancel out *This circuit was developed by Kato-san in MDR.
Ref: V. Vedral, A. Barenco, A. Ekert, https://arxiv.org/abs/quant-ph/9511018v1
Item Value
Before cancellation 5449 gates
After cancellation 3885 gates
Reduction rate 71.3 %
Also works for pairs of Y, Z, H gates whose squares are Identity.
26
DYNAMIC QUBIT GROUPING
If qubits are not entangled,
- State vector can be factorized.
- Reducing number of variables.
ۧ𝑠0|000
ۧ𝑠1|001
ۧ𝑠2|010
ۧ𝑠3|011
ۧ𝑠4|100
ۧ𝑠5|101
ۧ𝑠6|110
ۧ𝑠7|111
If 1 qubit is
not entangled,
ۧ𝑠10|00
ۧ𝑠11|01
ۧ𝑠12|10
ۧ𝑠13|11
ۧ𝑠00|0
ۧ𝑠01|1
⨂
3 qubit state vector
Size: 8 Size: 6 = (2 + 4)
1 qubit 2 qubits
27
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠2|0 … 10
ۧ𝑠220
−2|1 … 10
ۧ𝑠220
−1|1 … 11
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠210
−1|1 … 11
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠2|0 … 10
ۧ𝑠230
−3|1 … 01
ۧ𝑠230
−2|1 … 10
ۧ𝑠230
−1|1 … 11
DYNAMIC QUBIT GROUPING
30 qubit case
If qubits are divided to
10- and 20-qubit groups.
⨂
30 qubit state vector
Size: 230 Size: 220 + 210 ( 0.1 %)
10 qubits 20 qubits
…
…
…
28
EX. INVERSED QUANTUM FOURIER TRANSFORM
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
# Variables 10
(2x5)
10
(22 + 2x3)
12
(23 + 2x2)
18
(24 + 2)
32
(25)
H
Qubits are grouped when
a controlled gate applied.
29
EFFECTS OF DYNAMIC QUBIT GROUPING
Calculation amount reduced by applying qubit grouping.
iQFT, Numerical Estimation
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+11
1.0E+12
0.0E+00
2.0E-01
4.0E-01
6.0E-01
8.0E-01
1.0E+00
0 4 8 12 16 20 24 28 32
w/o Qubit grouping
w/ Qubit grouping
Reduction ratio
Ratioofcalculationamount
(Qubitgroupingenabled/disabled)
CalculationAmount
Log axis.
12.1 % at 30 qubits.
# qubits
30
CALCULATION AMOUNT COMPARISON
In the range where # qubits is small,
- Processing overheads are observed.
In the range where # qubits is big,
- Computation time is enough long, and
overhead is relatively small.
- Estimation and measurement matched.
Observed overhead
- Time for analyzing quantum circuit
- Managing grouped state vectors.
CUIDA/CPU/Theoretical
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 12 16 20 24 28 32
# Qubits
Reductionratio
Processing overheads
observed
Performance
improved as expected
CUDA
CPU(multi core)
Theoretical
31
OPERATOR REORDERING
Maximizing effects of dynamic qubit grouping
- Reordering operators into a smaller qubit
group
- Reducing amount of calculation.
U0 U1
U3
U4
U2
U0 U1
U3
U4
U2
32
BENCHMARK
One of the most important algorithms of quantum computing
- Shor’s algorithm
Used for order-finding problem (https://en.wikipedia.org/wiki/Shor%27s_algorithm)
- Quantum chemistry
Used for obtaining matrix eigen values
Phase Estimation
33
PHASE ESTIMATION
Without Operator Reordering
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
H
U16 U8 U4 U2 U
34
PHASE ESTIMATION
Operators are Reordered
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
H
U16 U8 U4 U2 U
35
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
- Running on a single Tesla V100 (32 GB)
Benchmark
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29qubits
iQFT
36
AN EXAMPLE OF CALCULATION RESULTS
1024 shots of sampling.
The initial value is 0.1
The initial value is 0.1.
Raw sampling results.
(0.09999997168779373, 1)
(0.09999998286366463, 1)
(0.09999998472630978, 1)
(0.09999999031424522, 1)
(0.09999999217689037, 1)
(0.09999999403953552, 4)
(0.09999999590218067, 4)
(0.09999999776482582, 26)
(0.09999999962747097, 900)
(0.10000000149011612, 57)
(0.10000000335276127, 17)
(0.10000000521540642, 7)
(0.10000000707805157, 1)
(0.10000000894069672, 1)
(0.10000001080334187, 1)
(0.10000001639127731, 1)
37
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
Operator Reordering, Single GPU
Runtime/ optimization Elapsed time [s] Acceleration
CPU / no optimization 213 1
CPU / optimized 24.7 8.6x
CUDA / no optimization 13.7 15.5x
CUDA / optimized 1.86 114x
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29 qubitsiQFT
38
MULTI GPU + NVLINK
39
IDEAL MULTI GPU PERFORMANCE
Performance Baseline (30 qubits, Single GPU)
# gates applied
in sec.
Memory
bandwidth
Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-
core)
1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
58.8 = 14.7 x 4 GPUs (DGX Station)
40
BOTTLENECK : DATA TRANSFER
Ex. DGX Station
NVLink is fast, but slower than GPU memory.
100 GB/s
100 GB/s
50 GB/s50 GB/s
50 GB/s 50 GB/s
900 GB/s 900 GB/s
900 GB/s900 GB/s
Bandwidth
GPU 900 GB/s
NVLink
(1 Link, bidirectional)
50 GB/s
41
QUBIT REORDERING
Applying gates to q0 ~ q3 is done in
each GPU.
When q4, q5 are included in target
qubits, data transfers between GPUs
happen.
Multi GPU, Reducing Data Transfers
Ex)
q0
q1
q2
q3
q4
q5
Gates are applied in each GPU
Data transfers between GPUs happen
for each gate application.
Ref: 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit, Thomas Häner, Damian S.Steiger, https://arxiv.org/abs/1704.01127
42
QUBIT REORDERING
Reordering qubits
- Swapping q0 ~ q2 and q3 ~ q5.
- All required inter-device
communications are done during
reordering qubits.
- All gates are applied in each
GPU.
Multi GPU, Reducing Data Transfers
Ex)
Gates are applied
in each GPU
Data transfers
between GPUs happen only here.
Reorderingqubits
q0
q1
q2
q3
q4
q5
q3
q4
q5
q0
q1
q2
Gates are applied
in each GPU
43
BENCHMARK
https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm
32 qubit circuit, 5120 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Quantum Volume(n=32, d=32), FP64, DGX Station (4 GPUs)
Runtime Optimization Elapsed time Acc.
CPU No optimization 3.1 hours -
CUDA,
4 Tesla V100
No optimization 370 sec 29.7 x
+ Qubit reordering* 318 sec 56.7 x
+ Qubit grouping
+ Operator reordering
176 sec 62.5 x
*: Qubits are reordered for 10 times during execution of the whole circuit.
44
BENCHMARK
32 qubit circuit, 558 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Phase estimation, 32 qubit circuit
Runtime Optimization Elapsed time Acc.
CPU No optimization 774 sec -
CUDA,
4 Tesla V100
No optimization 18.4 sec 42 x
+ Qubit reordering* 15.4 sec 50 x
+ Qubit grouping
+ Operator reordering
3.2 sec 242 x
*: Qubits are reordered for 8 times during execution of the whole circuit.
45
PLANS FOR THE NEXT VERSION
• Supporting hyper-cube-mesh topology.
• Fully utilizing 8 GPUs on servers such as DGX-1 and AWS p3dn.24xlarge instance
• Enabling to run 33 qubit circuit(float64).
• Acceleration for GPU kernels.
• Qgate 0.3 implements naïve GPU kernels to apply gates, not optimized yet.
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR

More Related Content

PDF
計算力学シミュレーションに GPU は役立つのか?
PDF
Profiling deep learning network using NVIDIA nsight systems
PDF
Jetson AGX Xavier and the New Era of Autonomous Machines
PDF
RAPIDS Overview
PDF
Volta (Tesla V100) の紹介
PPTX
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
PDF
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
PDF
Breaking New Frontiers in Robotics and Edge Computing with AI
計算力学シミュレーションに GPU は役立つのか?
Profiling deep learning network using NVIDIA nsight systems
Jetson AGX Xavier and the New Era of Autonomous Machines
RAPIDS Overview
Volta (Tesla V100) の紹介
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
Breaking New Frontiers in Robotics and Edge Computing with AI

What's hot (18)

PPTX
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
PDF
GTC 2018 で発表された自動運転最新情報
PPTX
Dalvik Vm &amp; Jit
PPTX
Intro to GPGPU Programming with Cuda
PDF
NVIDIA PRO VR DAY 2017 基調講演
PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
PPT
Vpu technology &gpgpu computing
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
CuPy: A NumPy-compatible Library for GPU
PDF
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
PDF
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
PDF
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
PDF
numPYNQ: accelerating NumPy on PYNQ
PDF
Early Benchmarking Results for Neuromorphic Computing
PDF
Cuda tutorial
PPTX
Intro to GPGPU with CUDA (DevLink)
PDF
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
PDF
PG-Strom - GPU Accelerated Asyncr
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
GTC 2018 で発表された自動運転最新情報
Dalvik Vm &amp; Jit
Intro to GPGPU Programming with Cuda
NVIDIA PRO VR DAY 2017 基調講演
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Vpu technology &gpgpu computing
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
CuPy: A NumPy-compatible Library for GPU
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
numPYNQ: accelerating NumPy on PYNQ
Early Benchmarking Results for Neuromorphic Computing
Cuda tutorial
Intro to GPGPU with CUDA (DevLink)
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
PG-Strom - GPU Accelerated Asyncr
Ad

Similar to QGATE 0.3: QUANTUM CIRCUIT SIMULATOR (20)

PPTX
Lecture_2_v2_qc.pptx
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
PDF
QX Simulator and quantum programming - 2020-04-28
PPTX
Lunch session: Quantum Computing
PPTX
What is Quantum Computing and Why it is Important
PDF
Quantum Computing and Qiskit
PDF
Virus, Vaccines, Genes and Quantum - 2020-06-18
PDF
Quantum computing for CS students: open source software
PDF
The 1st workshop on engineering processes and practices for quantum software ...
PDF
Programming quantum computers in Q# (Techorama NL 2018)
PPTX
Meetup web scale architecture quantum computing (Part 1 16-10-2018)
PDF
Quantum Computing and Java QC API—Strange
PDF
2024-11-05 - KAIST guest lecture - Aritra Sarkar
PDF
開発者が語る NVIDIA cuQuantum SDK
PDF
Full stack component of software and middleware for quantum machine
PPTX
Quantum Computing Fundamentals via OO
PDF
Quantum Computing with Amazon Braket
PDF
Search and optimization on quantum accelerators - 2019-05-23
PDF
Quantum Computing Notes Ver1.0
PPTX
Ph.D. Defense
Lecture_2_v2_qc.pptx
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
QX Simulator and quantum programming - 2020-04-28
Lunch session: Quantum Computing
What is Quantum Computing and Why it is Important
Quantum Computing and Qiskit
Virus, Vaccines, Genes and Quantum - 2020-06-18
Quantum computing for CS students: open source software
The 1st workshop on engineering processes and practices for quantum software ...
Programming quantum computers in Q# (Techorama NL 2018)
Meetup web scale architecture quantum computing (Part 1 16-10-2018)
Quantum Computing and Java QC API—Strange
2024-11-05 - KAIST guest lecture - Aritra Sarkar
開発者が語る NVIDIA cuQuantum SDK
Full stack component of software and middleware for quantum machine
Quantum Computing Fundamentals via OO
Quantum Computing with Amazon Braket
Search and optimization on quantum accelerators - 2019-05-23
Quantum Computing Notes Ver1.0
Ph.D. Defense
Ad

More from NVIDIA Japan (20)

PDF
HPC 的に H100 は魅力的な GPU なのか?
PDF
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
PDF
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
PDF
20221021_JP5.0.2-Webinar-JP_Final.pdf
PDF
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
PDF
NVIDIA HPC ソフトウエア斜め読み
PDF
HPC+AI ってよく聞くけど結局なんなの
PDF
Magnum IO GPUDirect Storage 最新情報
PDF
データ爆発時代のネットワークインフラ
PDF
Hopper アーキテクチャで、変わること、変わらないこと
PDF
GPU と PYTHON と、それから最近の NVIDIA
PDF
GTC November 2021 – テレコム関連アップデート サマリー
PDF
テレコムのビッグデータ解析 & AI サイバーセキュリティ
PDF
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
PDF
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
PDF
2020年10月29日 Jetson活用によるAI教育
PDF
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
PDF
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
PDF
Jetson Xavier NX クラウドネイティブをエッジに
PDF
GTC 2020 発表内容まとめ
HPC 的に H100 は魅力的な GPU なのか?
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
20221021_JP5.0.2-Webinar-JP_Final.pdf
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA HPC ソフトウエア斜め読み
HPC+AI ってよく聞くけど結局なんなの
Magnum IO GPUDirect Storage 最新情報
データ爆発時代のネットワークインフラ
Hopper アーキテクチャで、変わること、変わらないこと
GPU と PYTHON と、それから最近の NVIDIA
GTC November 2021 – テレコム関連アップデート サマリー
テレコムのビッグデータ解析 & AI サイバーセキュリティ
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
Jetson Xavier NX クラウドネイティブをエッジに
GTC 2020 発表内容まとめ

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx

QGATE 0.3: QUANTUM CIRCUIT SIMULATOR

  • 1. Shinya Morino, Sr. Solution Architect, NVIDIA, 2/14/2020 QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
  • 2. 2 NVIDIA AI TECHNOLOGY CENTER (NVAITC) Catalyse AI transformation through Research-Centric Integrated Engagements Singapore (AP HQ) Taiwan China Australia Hong Kong Luxembourg Established Aug 2015 in Singapore Collaboration Footprint: Singapore. ASEAN. Taiwan. China. Hong Kong. Australia. Europe. Thailand London Indonesia
  • 3. 3 QUANTUM COMPUTING Qubit (Quantum bit): - The basic unit of quantum computers. - Qubits are represented as a linear superposition of two basis states, |0> and |1>. ۧ|𝜓 = 𝛼 ۧ|0 + 𝛽 ۧ|1 𝛼 2 + 𝛽 2 = 1 - |0> or|1> is observed by measurement. Observation probabilities of |0> and |1> are 𝛼 2 and 𝛽 2 respectively. Qubit ۧ|0 = cos 𝜃 2 , ۧ|1 = 𝑒 𝑖𝜙 sin 𝜃 2
  • 4. 4 QUANTUM COMPUTING Quantum circuits consist with qubits and quantum logic gates. - With N qubits, 2N states can be represented (if entangled). - One quantum state corresponds to one complex number. Ex. With 53 qubits, 253 ( 10 Peta) states can be represented. Quantum states are controlled by using quantum logic gates. - Applying one gate can change 2N qubit states at the same time. - Developing quantum circuits is the programming for quantum computing Quantum circuit H H H H
  • 5. 5 QUANTUM CIRCUIT SIMULATION State vector - Quantum states are expended to a vector of complex numbers - Vector size is 2N for N-qubit circuits. - Each bit in index is corresponding to one qubit. Quantum states and state vector 𝑠0 𝑠1 𝑠2 ⋮ 𝑠2 𝑁 −2 𝑠2 𝑁 −1 ۧ|0 … 00 ۧ|0 … 01 ۧ|0 … 10 ⋮ ۧ|1 … 10 ۧ|1 … 11 index of state vector Quantum state (complex number) q0q1qN-1 … Qubits
  • 6. 6 Represented as a 2x2 unitary matrix Applying quantum gate to a state vector. QUANTUM CIRCUIT SIMULATION Quantum Logic Gate U 𝑈 = 𝑢00 𝑢01 𝑢10 𝑢11 𝑠𝑖+1,| ۧ…𝟎… 𝑠𝑖+1,| ۧ…𝟏… = 𝑈 𝑠𝑖,| ۧ…𝟎… 𝑠𝑖,| ۧ…𝟏… Gate U = 1 0 0 1 0 0 0 0 0 0 0 0 u00 u01 u10 u11 U Control Target Gate is applied when controlling gbit is |1>. Control gates can make qubits entangled. 𝑠𝑖+1,| ۧ…𝟏…𝟎… 𝑠𝑖+1,| ۧ…𝟏…𝟏… = 𝑈 𝑠𝑖,| ۧ…𝟏…𝟎… 𝑠𝑖,| ۧ…𝟏…𝟏…
  • 7. 7 It’s said … “Number of qubits” is the limitation, because vast amount of memory proportional to 2N, is required for simulations. PROBLEM DEFINITION Quantum circuit simulator is an essential tool to develop quantum circuits, but there’re limitations: But actual issue as of today is: “Simulation is very slow.” Needing long time for debugging and verifying quantum circuits
  • 8. 8 QUANTUM CIRCUIT EXAMPLES Circuit # qubits # gates Capacity of State vector Estimated simulation time Python*1 (CPU 1core) CPU*2 (multi-core) Quantum Volume*3 (width 32, depth 32) 32 5,120 64 GB 2 days 3 hours iQFT *4 (Ex: 32 qubits) 32 560 64 GB 3 hours 13 min Modulo operation ( 5n mod 12 ) 27 5,449 2 GB 45 min 3 min *1: Simulation with 1 cpu core. *2: Assuming 55 GB/sec of CPU memory bandwidth with naïve simulation algorithm. *3: https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm, *4: iQFT, Inversed Quantum Fourier Transform,
  • 10. 11 QGATE DESIGN CONCEPT 1. Easy development of quantum circuits with fast simulations for experiments Rich built-in gate set to quickly develop circuits Utilizing modern computing devices for performance 2. Single node, Multi GPU (multi devices) Utilizing a big server with a huge amount of memory. Focusing on performance. No intra-node communication. 3. Works as backends of other SDKs Simulations can be accelerated on Blueqat, various SDKs.
  • 11. 12 1. EASY DEVELOPMENT OF QUANTUM CIRCUITS Rich built-in gate set - Multi-bit-controlled gates, such as Toffoli gate is included in built-in gate set - Adjoint for all gates All qubits are fully connected IBM’s OpenQASM gate set is also supported
  • 12. 13 BUILT-IN OPERATORS Quantum logic gate Symbol Identity I Hadamard gate H Pauli gates and their rotations X, Y, Z, Rx(theta), Ry(theta), Rz(theta) Exponential of identity and Pauli gates Exp(I, X, Y, Z) Global phase Exp(theta) Phase shift gates P(theta), T, S Measurement, Probability Measure(qubit), Prob(qubit) Extensions OpenQASM’s U gates U3, U2, U1 Multi qubit measurement Measure(pauli gates)
  • 13. 14 UTILIZING MODERN COMPUTING DEVICES FOR PERFORMANCE Tesla V100 (SXM2) 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS 32 GB HBM2 @ 900GB/s | 300GB/s NVLink GPU CPU CPU runtime is also implemented. (Utilizing multi cores in one CPU socket)
  • 14. 15 TARGET HARDWARE Requirement: - Quantum circuit simulations need a huge amount of memory - Performance is important as well. DGX-2 - 512 GB of GPU memory in 16 Tesla V100 - By using NVLink, all memories in GPUs are in one address space. NVIDIA DGX-2
  • 15. 16 DGX-2 All GPUs are sharing a single address space. All-to-all connections by NVLink (300 GB/sec, bidirectional) - 512 GB of ultra-fast memory is available - FP32: 35 qubits FP64: 34 qubits 16 NVIDIA High-end GPUs + NVLink2
  • 16. 17 At a Glance GPUs 4x NVIDIA® Tesla® V100 TFLOPS (GPU FP16) 500 GPU Memory 32 GB per GPU NVIDIA Tensor Cores 2,560 (total) NVIDIA CUDA Cores 20,480 (total) CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core) System Memory 256 GB LRDIMM DDR4 Storage Data: 3 x 1.92 TB SSD RAID 0 OS: 1 x 1.92 TB SSD Network Dual 10 Gb LAN Display 3x DisplayPort, 4K Resolution Acoustics < 35 dB Maximum Power Requirements 1500 W Operating Temperature Range 10 - 30 oC Software Ubuntu Desktop Linux OS DGX Recommended GPU Driver CUDA Toolkit 17 NVIDIA DGX STATION
  • 17. 18 DGX STATION NVLINK NETWORK TOPOLOGY For Efficient Application Scaling NVIDIA NVLink Bridge - Four NVIDIA Tesla V100 accelerators - Each Tesla V100 GPU in DGX Station has four NVLink connection points, each providing a point- to-point connection to another GPU at a peak bandwidth of 25GB/s - Optimized for: - The bandwidth achievable for a variety of point- to-point and collective communications primitives - The flexibility of the topology - Performance with a subset of the GPUs
  • 18. 19 GPU REQUIREMENT Qgate runs with a single GPU, and scales to multiple GPUs in a single node. - Works with Kepler GPU (Cc3.5) or later. Recommendation is Maxwell GPU (Cc5.0) or later. Multi GPU requirement - NVLink : All-to-all NVLink connections between GPUs are required. For performance, NVLink is strongly recommended. - PCIe: All GPUs should be connected to the same PCIe root complex. CPU - Running with 1 CPU socket is supported. There’s no consideration for NUMA.
  • 19. 20 PERFORMANCE MEASUREMENT Quantum circuit for measurement - 10 Hadamard gates are placed on each qubit. - FP64 is used. Baseline, Single GPU Performance H H H H H H H H H H H H ... ... ... Device CPU (1 core)*1 Single thread on CPU CPU (multi-core) Multi-threaded*2 on CPU (40 threads, 20 physical cores) GPU GPU / CUDA 10 Hadamard gates *1: CPU(1 core) is a model of python-based simulator which is sometimes implemented by using 1 CPU core. *2: Implemented by using C++ STL’s thread class
  • 20. 21 SUMMARY Performance Baseline (30 qubits, Single GPU) # gates applied in sec. Memory bandwidth Acc. CPU (1 core) 0.11 3.7 GB/sec 1 - CPU (multi-core) 1.59 54.8 GB/sec 14.9x 1 GPU 23.5 806 GB/sec 220x 14.7x
  • 22. 23 PROCESSING PIPELINE Built with Python and Native Extensions Gate cancellation Runtime Removing cancelling gates Dynamically grouping qubits, Reducing number of variables required to represent quantum states Reordering operators (gates and measurements) in order to maximize effects of dynamic qubit grouping. Parallelization on computing devices CPU(multi-core), and GPU(CUDA) Python Input (Intermediate repr.) Native extension Output (state vector) Operator reordering Dynamic qubit grouping Quantum computing specific Device specific Reordering qubits to reduce data transfer between devices.Qubit reordering
  • 23. 24 Backend SOFTWARE DIAGRAM qgate.model Quantum circuit object model Built-in gate definitions qgate.simulator.runtimeqgate.simulator Simulator qgate.script Circuit definition on python qgate.openqasm Importing OpenQASM files qgate.simulator.qubits State vector Complex number probability Other plugins … Frontend Plugin Blueqat plugin qgate pyruntime: Python, reference cpuruntime: CPU, multi-core cudaruntime: CUDA, GPU OM (object model) Analyses and optimizations for quantum circuits Runtime Accelerating numerical operations
  • 24. 25 Products of some gate pairs cancel out 𝐼 = 𝑋 ∙ 𝑋 = 𝑌 ∙ 𝑌 = 𝑍 ∙ 𝑍 = 𝐻 ∙ 𝐻 GATE CANCELLATION Quantum Circuit Optimization U U U X U: Arbitrary unitary gate X U X XX Ex: Modulo arithmetic* (5^x mod 12, 27 qubits)Cancel out Cancel out *This circuit was developed by Kato-san in MDR. Ref: V. Vedral, A. Barenco, A. Ekert, https://arxiv.org/abs/quant-ph/9511018v1 Item Value Before cancellation 5449 gates After cancellation 3885 gates Reduction rate 71.3 % Also works for pairs of Y, Z, H gates whose squares are Identity.
  • 25. 26 DYNAMIC QUBIT GROUPING If qubits are not entangled, - State vector can be factorized. - Reducing number of variables. ۧ𝑠0|000 ۧ𝑠1|001 ۧ𝑠2|010 ۧ𝑠3|011 ۧ𝑠4|100 ۧ𝑠5|101 ۧ𝑠6|110 ۧ𝑠7|111 If 1 qubit is not entangled, ۧ𝑠10|00 ۧ𝑠11|01 ۧ𝑠12|10 ۧ𝑠13|11 ۧ𝑠00|0 ۧ𝑠01|1 ⨂ 3 qubit state vector Size: 8 Size: 6 = (2 + 4) 1 qubit 2 qubits
  • 26. 27 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠2|0 … 10 ۧ𝑠220 −2|1 … 10 ۧ𝑠220 −1|1 … 11 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠210 −1|1 … 11 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠2|0 … 10 ۧ𝑠230 −3|1 … 01 ۧ𝑠230 −2|1 … 10 ۧ𝑠230 −1|1 … 11 DYNAMIC QUBIT GROUPING 30 qubit case If qubits are divided to 10- and 20-qubit groups. ⨂ 30 qubit state vector Size: 230 Size: 220 + 210 ( 0.1 %) 10 qubits 20 qubits … … …
  • 27. 28 EX. INVERSED QUANTUM FOURIER TRANSFORM R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 # Variables 10 (2x5) 10 (22 + 2x3) 12 (23 + 2x2) 18 (24 + 2) 32 (25) H Qubits are grouped when a controlled gate applied.
  • 28. 29 EFFECTS OF DYNAMIC QUBIT GROUPING Calculation amount reduced by applying qubit grouping. iQFT, Numerical Estimation 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10 1.0E+11 1.0E+12 0.0E+00 2.0E-01 4.0E-01 6.0E-01 8.0E-01 1.0E+00 0 4 8 12 16 20 24 28 32 w/o Qubit grouping w/ Qubit grouping Reduction ratio Ratioofcalculationamount (Qubitgroupingenabled/disabled) CalculationAmount Log axis. 12.1 % at 30 qubits. # qubits
  • 29. 30 CALCULATION AMOUNT COMPARISON In the range where # qubits is small, - Processing overheads are observed. In the range where # qubits is big, - Computation time is enough long, and overhead is relatively small. - Estimation and measurement matched. Observed overhead - Time for analyzing quantum circuit - Managing grouped state vectors. CUIDA/CPU/Theoretical 0 0.2 0.4 0.6 0.8 1 1.2 1.4 8 12 16 20 24 28 32 # Qubits Reductionratio Processing overheads observed Performance improved as expected CUDA CPU(multi core) Theoretical
  • 30. 31 OPERATOR REORDERING Maximizing effects of dynamic qubit grouping - Reordering operators into a smaller qubit group - Reducing amount of calculation. U0 U1 U3 U4 U2 U0 U1 U3 U4 U2
  • 31. 32 BENCHMARK One of the most important algorithms of quantum computing - Shor’s algorithm Used for order-finding problem (https://en.wikipedia.org/wiki/Shor%27s_algorithm) - Quantum chemistry Used for obtaining matrix eigen values Phase Estimation
  • 32. 33 PHASE ESTIMATION Without Operator Reordering R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 H U16 U8 U4 U2 U
  • 33. 34 PHASE ESTIMATION Operators are Reordered R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 H U16 U8 U4 U2 U
  • 34. 35 PHASE ESTIMATION 30 qubit circuit, 493 gates, FP64 - Measuring global phase of one qubit. - 29 qubits are used for measurements. - Running on a single Tesla V100 (32 GB) Benchmark exp(i 2n-1q) exp(i 2n-2q) exp(i q) … … 29qubits iQFT
  • 35. 36 AN EXAMPLE OF CALCULATION RESULTS 1024 shots of sampling. The initial value is 0.1 The initial value is 0.1. Raw sampling results. (0.09999997168779373, 1) (0.09999998286366463, 1) (0.09999998472630978, 1) (0.09999999031424522, 1) (0.09999999217689037, 1) (0.09999999403953552, 4) (0.09999999590218067, 4) (0.09999999776482582, 26) (0.09999999962747097, 900) (0.10000000149011612, 57) (0.10000000335276127, 17) (0.10000000521540642, 7) (0.10000000707805157, 1) (0.10000000894069672, 1) (0.10000001080334187, 1) (0.10000001639127731, 1)
  • 36. 37 PHASE ESTIMATION 30 qubit circuit, 493 gates, FP64 - Measuring global phase of one qubit. - 29 qubits are used for measurements. Operator Reordering, Single GPU Runtime/ optimization Elapsed time [s] Acceleration CPU / no optimization 213 1 CPU / optimized 24.7 8.6x CUDA / no optimization 13.7 15.5x CUDA / optimized 1.86 114x exp(i 2n-1q) exp(i 2n-2q) exp(i q) … … 29 qubitsiQFT
  • 37. 38 MULTI GPU + NVLINK
  • 38. 39 IDEAL MULTI GPU PERFORMANCE Performance Baseline (30 qubits, Single GPU) # gates applied in sec. Memory bandwidth Acc. CPU (1 core) 0.11 3.7 GB/sec 1 - CPU (multi- core) 1.59 54.8 GB/sec 14.9x 1 GPU 23.5 806 GB/sec 220x 14.7x 58.8 = 14.7 x 4 GPUs (DGX Station)
  • 39. 40 BOTTLENECK : DATA TRANSFER Ex. DGX Station NVLink is fast, but slower than GPU memory. 100 GB/s 100 GB/s 50 GB/s50 GB/s 50 GB/s 50 GB/s 900 GB/s 900 GB/s 900 GB/s900 GB/s Bandwidth GPU 900 GB/s NVLink (1 Link, bidirectional) 50 GB/s
  • 40. 41 QUBIT REORDERING Applying gates to q0 ~ q3 is done in each GPU. When q4, q5 are included in target qubits, data transfers between GPUs happen. Multi GPU, Reducing Data Transfers Ex) q0 q1 q2 q3 q4 q5 Gates are applied in each GPU Data transfers between GPUs happen for each gate application. Ref: 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit, Thomas Häner, Damian S.Steiger, https://arxiv.org/abs/1704.01127
  • 41. 42 QUBIT REORDERING Reordering qubits - Swapping q0 ~ q2 and q3 ~ q5. - All required inter-device communications are done during reordering qubits. - All gates are applied in each GPU. Multi GPU, Reducing Data Transfers Ex) Gates are applied in each GPU Data transfers between GPUs happen only here. Reorderingqubits q0 q1 q2 q3 q4 q5 q3 q4 q5 q0 q1 q2 Gates are applied in each GPU
  • 42. 43 BENCHMARK https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm 32 qubit circuit, 5120 gates, FP64 Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4 Quantum Volume(n=32, d=32), FP64, DGX Station (4 GPUs) Runtime Optimization Elapsed time Acc. CPU No optimization 3.1 hours - CUDA, 4 Tesla V100 No optimization 370 sec 29.7 x + Qubit reordering* 318 sec 56.7 x + Qubit grouping + Operator reordering 176 sec 62.5 x *: Qubits are reordered for 10 times during execution of the whole circuit.
  • 43. 44 BENCHMARK 32 qubit circuit, 558 gates, FP64 Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4 Phase estimation, 32 qubit circuit Runtime Optimization Elapsed time Acc. CPU No optimization 774 sec - CUDA, 4 Tesla V100 No optimization 18.4 sec 42 x + Qubit reordering* 15.4 sec 50 x + Qubit grouping + Operator reordering 3.2 sec 242 x *: Qubits are reordered for 8 times during execution of the whole circuit.
  • 44. 45 PLANS FOR THE NEXT VERSION • Supporting hyper-cube-mesh topology. • Fully utilizing 8 GPUs on servers such as DGX-1 and AWS p3dn.24xlarge instance • Enabling to run 33 qubit circuit(float64). • Acceleration for GPU kernels. • Qgate 0.3 implements naïve GPU kernels to apply gates, not optimized yet.