SlideShare a Scribd company logo
Min-Yih “Min” Hsu, David Gens, Michael Franz. University of California, Irvine
MCA Daemon
Hybrid Throughput Analysis Beyond Basic Blocks
Keynote, EuroLLVM 2022
Outline
2
Outline
Motivation
2
Outline
Motivation
MCA Daemon (MCAD)
2
Outline
Motivation
MCA Daemon (MCAD)
Future Plans
2
Outline
Motivation
MCA Daemon (MCAD)
Future Plans
Epilogue
2
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
3
Genesis: Assured Micro Patching (AMP)
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
• Example: After the
fi
rmware on a truck is binary-patched to prevent brakes
from locking up, we need to make sure latencies do not degrade terribly
4
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
Same set of inputs
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
ΔT?
Same set of inputs
Execution time assessment
6
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
ΔT?
Same set of inputs
Execution time assessment
Interesting use cases
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
• Examples: Potential CPU pipeline stalling, GPU memory bank con
fl
icts
7
Execution time assessment
Previous e
ff
orts
8
Execution time assessment
Previous e
ff
orts
• Static approaches
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
• Dynamic approaches
• Cycle-accurate simulators / emulators
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
• Dynamic approaches
• Cycle-accurate simulators / emulators
• Examples: gem5, gpgpu-sim
8
Execution time assessment
Challenges
9
Static Dynamic
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
• Usually require non-trivial
setup

• Slow simulation speed
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
?
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
• Usually require non-trivial
setup

• Slow simulation speed
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
10
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2
Streaming
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
12
QEMU
LLVM MCA
Libraries
Target Program
Online Environment
Process 1 Process 2
Execution Trace
Streaming
Introduction to LLVM MCA
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
• Implemented via lightweight simulation

• Abstract real CPU pipeline stages into a small handful of stages
13
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary
Iterations: 300


Instructions: 900


Total Cycles: 610


Total uOps: 900


Dispatch Width: 2


uOps Per Cycle: 1.48


IPC: 1.48


Block RThroughput: 2.0
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
15
llvm-mca
MCAD
15
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD
15
QEMU
LLVM MCA
Libraries
Target Program
Process 1 Process 2
Execution Trace
Streaming
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD
MCA Daemon (MCAD)
Highlights
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
• Throughput analysis is happening in parallel / on-the-fly with the
target program execution
16
Implementation
17
Analyze execution traces using MCA
Using unmodi
fi
ed MCA libraries
18
QEMU
Target Program
Executed
instructions
LLVM MCA
Libraries
Disassembler
Analyze execution traces using MCA
Challenge: Sequential work
fl
ow
19
QEMU
Target Program
Blocked until QEMU is
fi
nished
Executed
instructions
LLVM MCA
Libraries
Disassembler
MCA internal
20
Assembly
fi
le
MCA internal
20
MCInst
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction
mca::SourceMgr
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::SourceMgr
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Assembly
fi
le
MCA with execution trace stream as input
21
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Execution
Trace
Stream
MCA with execution trace stream as input
21
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Execution
Trace
Stream
Blocking
Incremental SourceMgr
22
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::IncrementalSourceMgr
Execution
Trace
Stream
Incremental SourceMgr
22
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
Making mca::Instruction available to
simulation pipeline right away
mca::IncrementalSourceMgr
Execution
Trace
Stream
Incremental SourceMgr
23
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr
23
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 2
Process 1
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
mca::Instruction fetching loop
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
mca::Instruction fetching loop
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Cons: To use IncrementalSourceMgr, you have to use threads
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of trace
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of trace
A subset of
mca::Instruction
Incremental SourceMgr
Better solution: Resumable simulation pipeline
27
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of
mca::Instruction
Incremental SourceMgr
Better solution: Resumable simulation pipeline
28
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Pause
Incremental SourceMgr
Better solution: Resumable simulation pipeline
28
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Pause
Resumable simulation pipeline
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
• Minor downside: Modi
fi
cations on the simulation pipeline
29
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
• Collected ~1 million instructions
Challenge
31
Challenge
• Created a signi
fi
cant amount of memory footprint
31
Challenge
• Created a signi
fi
cant amount of memory footprint
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
MCInst mca::Instruction
mca::InstrBuilder
(Unit: MB)
Large memory footprint
Root cause
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
• Doesn’t scale really well with large input (recall: ~1 million instructions)
32
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Stream direction
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Recycle
mca::InstrBuilder
67%
improvement on accumulated memory consumption
35
~70%
of the mca::Instruction objects are recycled
36
Collecting execution traces via QEMU
37
user mode qemu
Target Program
QEMU MCAD
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
QEMU MCAD
Broker
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
Receiver
TCP Socket
QEMU MCAD
Broker
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument Disassembler
Receiver
TCP Socket
MCInst
QEMU MCAD
Custom QEMU TCG plugin
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
• Sending raw binary instructions* through TCP sockets
38
QEMU
Complete structure of MCAD
39
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder
QEMU
Complete structure of MCAD
39
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
☹
Recycling InstrBuilder
QEMU
Complete structure of MCAD
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
QEMU
Complete structure of MCAD
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Loadable Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
Loadable Plugin
Complete structure of MCAD
Example: Assembly broker plugin
41
Display
Assembly Broker
AsmParser
Assembly File
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
Evaluation
Scalability compared against llvm-mca
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null - 

• Size of the trace: ~27 million x86_64 instructions
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null - 

• Size of the trace: ~27 million x86_64 instructions
• For baseline, we dump the execution trace (assembly instructions) to a
fi
le before
feeding into llvm-mca

• Time measurement on baseline only accounts for llvm-mca’s run time. Excluding
the trace collection time.
42
Evaluation
Scalability compared against llvm-mca
43
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
4x Faster
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster
13x Less
Evaluation
Scalability compared against other static throughput analysis tools
44
Analysis Time Max Resident Memory
uiCA Timeout after 48h 113 GB
OSACA Exit w/ error after 24h N/A
Ithemal Exit w/ error after 2m N/A
MCAD 52.69s 2.16 GB
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
45
// TODO
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
• Going upstream: QEMU & LLVM
46
Going upstream: LLVM
The plan
47
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
47
Going upstream: LLVM
The plan: components to upstream
48
QEMU
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder
Display*
Not intended to upstream for now
Intended to upstream
Unrelated / no change
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
• Some of the (advanced) interfaces in broker plugin are only used by QEMU
broker. So, without the latter, it’s not well tested.
49
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
50
Summary
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
• We would like to merge these changes upstream to bene
fi
t the wider
community
51
Source code
52
https://github.com/securesystemslab/LLVM-MCA-Daemon
Acknowledgements
53
This material is based upon work partially supported by the
Defense Advanced Research Projects Agency (DARPA) under
contract N66001-20-C-4027
Special Thanks:
Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)
Thank You!
54
Q&A
Aldrich Park @ UC Irvine, California
Me
Appendix
55
Introduction to MCA
An example
56
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline
012345


Index 0123456789


[0,0] DeeER. . . vmulps


[0,1] D==eeeER . . vhaddps


[0,2] .D====eeeER . vhaddps


[1,0] .DeeE-----R . vmulps


[1,1] . D=eeeE---R . vhaddps


[1,2] . D====eeeER . vhaddps


[2,0] . DeeE-----R . vmulps


[2,1] . D====eeeER . vhaddps


[2,2] . D======eeeER vhaddps
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
Introduction to MCA
An example
56
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline
012345


Index 0123456789


[0,0] DeeER. . . vmulps


[0,1] D==eeeER . . vhaddps


[0,2] .D====eeeER . vhaddps


[1,0] .DeeE-----R . vmulps


[1,1] . D=eeeE---R . vhaddps


[1,2] . D====eeeER . vhaddps


[2,0] . DeeE-----R . vmulps


[2,1] . D====eeeER . vhaddps


[2,2] . D======eeeER vhaddps
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s

More Related Content

PDF
淺談 Live patching technology
PDF
Distributed Locking in Kubernetes
PDF
Embedded Android : System Development - Part II (Linux device drivers)
PDF
IPMI is dead, Long live Redfish
PDF
Linux Performance Analysis and Tools
PDF
Jagan Teki - U-boot from scratch
PDF
Building Embedded Linux Full Tutorial for ARM
PDF
Hands-on ethernet driver
淺談 Live patching technology
Distributed Locking in Kubernetes
Embedded Android : System Development - Part II (Linux device drivers)
IPMI is dead, Long live Redfish
Linux Performance Analysis and Tools
Jagan Teki - U-boot from scratch
Building Embedded Linux Full Tutorial for ARM
Hands-on ethernet driver

What's hot (20)

PDF
Porting Android
PDF
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
PDF
BPF - in-kernel virtual machine
PDF
Linux : PSCI
PDF
Git challenges
PPTX
Bottom half in linux kernel
PDF
COSCUP 2016 - LLVM 由淺入淺
PPTX
RISC-V Boot Process: One Step at a Time
PPTX
Git in 10 minutes
PPTX
White Box Testing
PPTX
The Next Linux Superpower: eBPF Primer
PDF
Grub2 Booting Process
PDF
U-Boot - An universal bootloader
PDF
Introduction to open_sbi
PDF
eBPF - Rethinking the Linux Kernel
PDF
ebpf and IO Visor: The What, how, and what next!
PDF
Let's trace Linux Lernel with KGDB @ COSCUP 2021
PDF
Bootloaders
PDF
LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例
PDF
Porting Android
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
BPF - in-kernel virtual machine
Linux : PSCI
Git challenges
Bottom half in linux kernel
COSCUP 2016 - LLVM 由淺入淺
RISC-V Boot Process: One Step at a Time
Git in 10 minutes
White Box Testing
The Next Linux Superpower: eBPF Primer
Grub2 Booting Process
U-Boot - An universal bootloader
Introduction to open_sbi
eBPF - Rethinking the Linux Kernel
ebpf and IO Visor: The What, how, and what next!
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Bootloaders
LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例
Ad

Similar to MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks (20)

PDF
Intel Atom Processor Pre-Silicon Verification Experience
PDF
Java Performance Tuning
PPTX
Performance tuning Grails applications SpringOne 2GX 2014
PDF
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
PPTX
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
PDF
Huawei Advanced Data Science With Spark Streaming
PPTX
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
PPTX
Spyglass dft
PPTX
Performance tuning Grails Applications GR8Conf US 2014
PDF
5 Steps on the Way to Continuous Delivery
PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
PDF
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
PPTX
Web Application Release
PPTX
Værktøjer udviklet på AAU til analyse af SCJ programmer
PDF
Planning and Control Algorithms Model-Based Approach (State-Space)
PDF
Streaming meetup
PPTX
VLSI LECTURES. of the advanced vlsi module
PDF
When Should I Use Simulation?
PDF
Performance tuning Grails applications
PDF
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Intel Atom Processor Pre-Silicon Verification Experience
Java Performance Tuning
Performance tuning Grails applications SpringOne 2GX 2014
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Huawei Advanced Data Science With Spark Streaming
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Spyglass dft
Performance tuning Grails Applications GR8Conf US 2014
5 Steps on the Way to Continuous Delivery
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Web Application Release
Værktøjer udviklet på AAU til analyse af SCJ programmer
Planning and Control Algorithms Model-Based Approach (State-Space)
Streaming meetup
VLSI LECTURES. of the advanced vlsi module
When Should I Use Simulation?
Performance tuning Grails applications
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Ad

More from Min-Yih Hsu (14)

PDF
Debug Information And Where They Come From
PDF
Handling inline assembly in Clang and LLVM
PDF
How to write a TableGen backend
PDF
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
PDF
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
PDF
Paper Study - Demand-Driven Computation of Interprocedural Data Flow
PDF
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
PDF
Souper-Charging Peepholes with Target Machine Info
PDF
From V8 to Modern Compilers
PDF
Introduction to Khronos SYCL
PDF
Trace Scheduling
PDF
Polymer Start-Up (SITCON 2016)
PDF
War of Native Speed on Web (SITCON2016)
PDF
From Android NDK To AOSP
Debug Information And Where They Come From
Handling inline assembly in Clang and LLVM
How to write a TableGen backend
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
Paper Study - Demand-Driven Computation of Interprocedural Data Flow
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Souper-Charging Peepholes with Target Machine Info
From V8 to Modern Compilers
Introduction to Khronos SYCL
Trace Scheduling
Polymer Start-Up (SITCON 2016)
War of Native Speed on Web (SITCON2016)
From Android NDK To AOSP

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Understanding Forklifts - TECH EHS Solution
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
System and Network Administration Chapter 2
PPTX
Transform Your Business with a Software ERP System
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Nekopoi APK 2025 free lastest update
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo Companies in India – Driving Business Transformation.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ISO 45001 Occupational Health and Safety Management System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Understanding Forklifts - TECH EHS Solution
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administration Chapter 2
Transform Your Business with a Software ERP System
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms I-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Wondershare Filmora 15 Crack With Activation Key [2025
Online Work Permit System for Fast Permit Processing
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Nekopoi APK 2025 free lastest update
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...

MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

  • 1. Min-Yih “Min” Hsu, David Gens, Michael Franz. University of California, Irvine MCA Daemon Hybrid Throughput Analysis Beyond Basic Blocks Keynote, EuroLLVM 2022
  • 8. Genesis: Assured Micro Patching (AMP) 4
  • 9. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code 4
  • 10. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects 4
  • 11. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches 4
  • 12. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems 4
  • 13. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems • UCI was studying the timing impacts of binary patches 4
  • 14. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems • UCI was studying the timing impacts of binary patches • Example: After the fi rmware on a truck is binary-patched to prevent brakes from locking up, we need to make sure latencies do not degrade terribly 4
  • 15. Timing impacts of binary patches Problem de fi nition 5 Original Program
  • 16. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program
  • 17. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program Original Program Same set of inputs
  • 18. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program Original Program ΔT? Same set of inputs
  • 21. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) 7
  • 22. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) • Performance analysis • Insights into performance bottlenecks 7
  • 23. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) • Performance analysis • Insights into performance bottlenecks • Examples: Potential CPU pipeline stalling, GPU memory bank con fl icts 7
  • 25. Execution time assessment Previous e ff orts • Static approaches 8
  • 26. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically 8
  • 27. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal 8
  • 28. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal • Dynamic approaches • Cycle-accurate simulators / emulators 8
  • 29. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal • Dynamic approaches • Cycle-accurate simulators / emulators • Examples: gem5, gpgpu-sim 8
  • 31. Execution time assessment Challenges 9 Static Dynamic High Low Precision
  • 32. Execution time assessment Challenges 9 Static Dynamic High Low Precision • Complete execution traces • Higher fi delity on hardware details
  • 33. Execution time assessment Challenges 9 Static Dynamic High Low Precision • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details
  • 34. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details
  • 35. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools
  • 36. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools • Usually require non-trivial setup • Slow simulation speed
  • 37. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround ? • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools • Usually require non-trivial setup • Slow simulation speed
  • 39. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program
  • 40. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace
  • 41. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace • The instructions that just got executed • Run-time values (e.g. register values)
  • 42. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace Online Environment Process 1 Process 2 • The instructions that just got executed • Run-time values (e.g. register values)
  • 43. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace Online Environment Process 1 Process 2 Streaming • The instructions that just got executed • Run-time values (e.g. register values)
  • 44. MCA Daemon (MCAD) High-level concept 12 QEMU LLVM MCA Libraries Target Program Online Environment Process 1 Process 2 Execution Trace Streaming
  • 46. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code 13
  • 47. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box 13
  • 48. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box • Accounting for modern processor features: super scalar, out-of-order etc. 13
  • 49. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box • Accounting for modern processor features: super scalar, out-of-order etc. • Implemented via lightweight simulation • Abstract real CPU pipeline stages into a small handful of stages 13
  • 50. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s
  • 51. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 52. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary Iterations: 300 Instructions: 900 Total Cycles: 610 Total uOps: 900 Dispatch Width: 2 uOps Per Cycle: 1.48 IPC: 1.48 Block RThroughput: 2.0 llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 55. 15 QEMU LLVM MCA Libraries Target Program Process 1 Process 2 Execution Trace Streaming llvm-mca Assembly fi le LLVM MCA Libraries llvm-mca MCAD
  • 57. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis 16
  • 58. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis • Augment the analysis region beyond basic blocks • MCAD is able to analyze the entire program execution trace 16
  • 59. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis • Augment the analysis region beyond basic blocks • MCAD is able to analyze the entire program execution trace • Throughput analysis is happening in parallel / on-the-fly with the target program execution 16
  • 61. Analyze execution traces using MCA Using unmodi fi ed MCA libraries 18 QEMU Target Program Executed instructions LLVM MCA Libraries Disassembler
  • 62. Analyze execution traces using MCA Challenge: Sequential work fl ow 19 QEMU Target Program Blocked until QEMU is fi nished Executed instructions LLVM MCA Libraries Disassembler
  • 67. MCA internal 20 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::SourceMgr Assembly fi le
  • 68. MCA internal 20 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Assembly fi le
  • 69. MCA with execution trace stream as input 21 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Execution Trace Stream
  • 70. MCA with execution trace stream as input 21 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Execution Trace Stream Blocking
  • 71. Incremental SourceMgr 22 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::IncrementalSourceMgr Execution Trace Stream
  • 72. Incremental SourceMgr 22 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display Making mca::Instruction available to simulation pipeline right away mca::IncrementalSourceMgr Execution Trace Stream
  • 73. Incremental SourceMgr 23 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 74. Incremental SourceMgr 23 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 2 Process 1
  • 75. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1
  • 76. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread
  • 77. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread mca::Instruction fetching loop
  • 78. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread mca::Instruction fetching loop
  • 79. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread
  • 80. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread Pros: No modi fi cation on the simulation pipeline
  • 81. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread Pros: No modi fi cation on the simulation pipeline Cons: To use IncrementalSourceMgr, you have to use threads
  • 82. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 83. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of trace
  • 84. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of trace A subset of mca::Instruction
  • 85. Incremental SourceMgr Better solution: Resumable simulation pipeline 27 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of mca::Instruction
  • 86. Incremental SourceMgr Better solution: Resumable simulation pipeline 28 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Pause
  • 87. Incremental SourceMgr Better solution: Resumable simulation pipeline 28 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Pause
  • 89. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions 29
  • 90. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline 29
  • 91. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses 29
  • 92. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses • You can still wrap resumable pipeline with a thread 29
  • 93. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses • You can still wrap resumable pipeline with a thread • Minor downside: Modi fi cations on the simulation pipeline 29
  • 94. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 95. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device
  • 96. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device • Only record the user space traces
  • 97. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device • Only record the user space traces • Collected ~1 million instructions
  • 99. Challenge • Created a signi fi cant amount of memory footprint 31
  • 100. Challenge • Created a signi fi cant amount of memory footprint 31 (Unit: MB)
  • 101. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 102. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 103. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 104. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 MCInst mca::Instruction mca::InstrBuilder (Unit: MB)
  • 106. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished 32
  • 107. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished • mca::Instruction is also used for tracking simulation state, so it’s hard to make it immutable 32
  • 108. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished • mca::Instruction is also used for tracking simulation state, so it’s hard to make it immutable • Doesn’t scale really well with large input (recall: ~1 million instructions) 32
  • 112. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction
  • 113. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction
  • 114. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction Recycle mca::InstrBuilder
  • 115. 67% improvement on accumulated memory consumption 35
  • 116. ~70% of the mca::Instruction objects are recycled 36
  • 117. Collecting execution traces via QEMU 37 user mode qemu Target Program QEMU MCAD
  • 118. Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument QEMU MCAD
  • 119. Broker Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument Receiver TCP Socket QEMU MCAD
  • 120. Broker Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument Disassembler Receiver TCP Socket MCInst QEMU MCAD
  • 121. Custom QEMU TCG plugin 38
  • 122. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed 38
  • 123. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed • We also added a few plugin interfaces (not upstreamed yet) • Example: Retrieving CPU register values 38
  • 124. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed • We also added a few plugin interfaces (not upstreamed yet) • Example: Retrieving CPU register values • Sending raw binary instructions* through TCP sockets 38
  • 125. QEMU Complete structure of MCAD 39 TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Recycling InstrBuilder
  • 126. QEMU Complete structure of MCAD 39 TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets ☹ Recycling InstrBuilder
  • 127. QEMU Complete structure of MCAD A modular design 40 TCG Plugin Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 128. QEMU Complete structure of MCAD A modular design 40 TCG Plugin Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Loadable Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 129. Loadable Plugin Complete structure of MCAD Example: Assembly broker plugin 41 Display Assembly Broker AsmParser Assembly File MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 131. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption 42
  • 132. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption • Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a 14KB MPEG-4 video fi le • Command: ffmpeg -i input.mp4 -f null - • Size of the trace: ~27 million x86_64 instructions 42
  • 133. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption • Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a 14KB MPEG-4 video fi le • Command: ffmpeg -i input.mp4 -f null - • Size of the trace: ~27 million x86_64 instructions • For baseline, we dump the execution trace (assembly instructions) to a fi le before feeding into llvm-mca • Time measurement on baseline only accounts for llvm-mca’s run time. Excluding the trace collection time. 42
  • 135. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD
  • 136. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD 4x Faster
  • 137. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD Max resident memory Gigabytes 0 5 10 15 20 25 30 4x Faster
  • 138. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD Max resident memory Gigabytes 0 5 10 15 20 25 30 4x Faster 13x Less
  • 139. Evaluation Scalability compared against other static throughput analysis tools 44 Analysis Time Max Resident Memory uiCA Timeout after 48h 113 GB OSACA Exit w/ error after 24h N/A Ithemal Exit w/ error after 2m N/A MCAD 52.69s 2.16 GB
  • 142. // TODO • More e ffi cient ways to collect traces without QEMU 46
  • 143. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs 46
  • 144. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) 46
  • 145. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display 46
  • 146. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display • Example: Loadable plugins for custom display of the result 46
  • 147. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display • Example: Loadable plugins for custom display of the result • Going upstream: QEMU & LLVM 46
  • 149. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst 47
  • 150. Going upstream: LLVM The plan: components to upstream 48 QEMU TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Recycling InstrBuilder Display* Not intended to upstream for now Intended to upstream Unrelated / no change
  • 151. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst 49
  • 152. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree 49
  • 153. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now 49
  • 154. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now • With the assembly broker, MCAD can be a drop-in replacement for llvm- mca…with even more features (e.g. the broker plugin infrastructure) 49
  • 155. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now • With the assembly broker, MCAD can be a drop-in replacement for llvm- mca…with even more features (e.g. the broker plugin infrastructure) • Some of the (advanced) interfaces in broker plugin are only used by QEMU broker. So, without the latter, it’s not well tested. 49
  • 158. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries 51
  • 159. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications 51
  • 160. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions 51
  • 161. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions • We improved the performance & fl exibility of core MCA libraries 51
  • 162. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions • We improved the performance & fl exibility of core MCA libraries • We would like to merge these changes upstream to bene fi t the wider community 51
  • 164. Acknowledgements 53 This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under contract N66001-20-C-4027 Special Thanks: Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)
  • 165. Thank You! 54 Q&A Aldrich Park @ UC Irvine, California Me
  • 167. Introduction to MCA An example 56 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline 012345 Index 0123456789 [0,0] DeeER. . . vmulps [0,1] D==eeeER . . vhaddps [0,2] .D====eeeER . vhaddps [1,0] .DeeE-----R . vmulps [1,1] . D=eeeE---R . vhaddps [1,2] . D====eeeER . vhaddps [2,0] . DeeE-----R . vmulps [2,1] . D====eeeER . vhaddps [2,2] . D======eeeER vhaddps llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 168. Introduction to MCA An example 56 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline 012345 Index 0123456789 [0,0] DeeER. . . vmulps [0,1] D==eeeER . . vhaddps [0,2] .D====eeeER . vhaddps [1,0] .DeeE-----R . vmulps [1,1] . D=eeeE---R . vhaddps [1,2] . D====eeeER . vhaddps [2,0] . DeeE-----R . vmulps [2,1] . D====eeeER . vhaddps [2,2] . D======eeeER vhaddps llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s