SlideShare a Scribd company logo
Java Performance:
Speedup your applications with hardware counters
Sergey Kuksenko
sergey.kuksenko@oracle.com, @kuksenk0
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
pmu-hwc-java
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
The following is intended to outline our general product direction. It
is intended for information purposes only, and may not be
incorporated into any contract. It is not a commitment to deliver any
material, code, or functionality, and should not be relied upon in
making purchasing decisions. The development, release, and timing
of any features or functionality described for Oracle’s products
remains at the sole discretion of Oracle.
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. 2/66
Intriguing Introductory Example
Gotcha, here is my hot method
or
On Algorithmic O ptimizations
just-O
3/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
4/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
O (N3)
big-O
4/66
Example 1: "Ok Google"
O(N2.81)
5/66
Example 1: Very Hot Code
N multiply strassen
128 0.0026 0.0023
512 0.48 0.12
2048 28 5.8
8192 3571 282
time; seconds/op
6/66
Example 1: Optimized speedup over baseline
7/66
Try the other way
8/66
Example 1: JMH, -prof perfnorm, N=256
per multiplication per iteration
CPI 1.06
cycles 146 × 106
8.7
instructions 137 × 106
8.2
L1-loads 68 × 106
4
L1-load-misses 33 × 106
2
L1-stores 0.56 × 106
L1-store-misses 38 × 103
LLC-loads 26 × 106
1.6
LLC-stores 11.6 × 103
∼ 256Kb per matrix
9/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
L1-load-misses
10/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
10/66
Example 1: Very Hot Code
public int[][] multiplyIKJ(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
int aik = A[i][k];
for (int j = 0; j < size; j++) {
R[i][j] += aik * B[k][j];
}
}
}
return R;
}
11/66
Example 1: Very Hot Code
IJK IKJ
N multiply strassen multiply
128 0.0026 0.0023 0.0005
512 0.48 0.12 0.03
2048 28 5.8 2
8192 3571 282 264
time; seconds/op
12/66
Example 1: Very Hot Code
IJK IKJ
N multiply strassen multiply strassen
128 0.0026 0.0023 0.0005
512 0.48 0.12 0.03 0.02
2048 28 5.8 2 1.5
8192 3571 282 264 79
time; seconds/op
12/66
Example 1: Optimized speedup over baseline
13/66
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51
cycles 146 × 106
8.7 9.7 × 106
0.6
instructions 137 × 106
8.2 19 × 106
1.1
L1-loads 68 × 106
4 5.4 × 106
0.3
L1-load-misses 33 × 106
2 1.1 × 106
0.1
L1-stores 0.56 × 106
2.7 × 106
0.2
L1-store-misses 38 × 103
9 × 103
LLC-loads 26 × 106
1.6 0.3 × 106
LLC-stores 11.6 × 103
3.5 × 103
14/66
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51
cycles 146 × 106
8.7 9.7 × 106
0.6
instructions 137 × 106
8.2 19 × 106
1.1
L1-loads 68 × 106
4 5.4 × 106
0.3
L1-load-misses 33 × 106
2 1.1 × 106
0.1
L1-stores 0.56 × 106
2.7 × 106
0.2
L1-store-misses 38 × 103
9 × 103
LLC-loads 26 × 106
1.6 0.3 × 106
LLC-stores 11.6 × 103
3.5 × 103
?
14/66
Example 1: Free beer benefits!
cycles insts
...
0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5
8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5
43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5
6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore
19.82% 20.82% add $0x8,%edx ;*iinc
1.46% 2.75% cmp %ecx,%edx
jl <addr> ;*if_icmpge
...
how to get asm: JMH, -prof perfasm
15/66
Example 1: Free beer benefits!
cycles insts
...
0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5
8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5
43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5
6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore
19.82% 20.82% add $0x8,%edx ;*iinc
1.46% 2.75% cmp %ecx,%edx
jl <addr> ;*if_icmpge
...
Vectorization (SSE/AVX)!
how to get asm: JMH, -prof perfasm
15/66
Chapter 1
Performance Optimization Methodology:
A Very Short Introduction.
Three magic questions and the direction of the journey
16/66
Three magic questions
17/66
Three magic questions
What?
• What prevents my application to work faster?
(monitoring)
17/66
Three magic questions
What? ⇒ Where?
• What prevents my application to work faster?
(monitoring)
• Where does it hide?
(profiling)
17/66
Three magic questions
What? ⇒ Where? ⇒ How?
• What prevents my application to work faster?
(monitoring)
• Where does it hide?
(profiling)
• How to stop it messing with performance?
(tuning/optimizing)
17/66
Top-Down Approach
• System Level
– Network, Disk, OS, CPU/Memory
• JVM Level
– GC/Heap, JIT, Classloading
• Application Level
– Algorithms, Synchronization, Threading, API
• Microarchitecture Level
– Caches, Data/Code alignment, CPU Pipeline Stalls
18/66
Methodology in Essence
19/66
Methodology in Essence
http://j.mp/PerfMindMap
19/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
• Lots of %idle ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
• Lots of %idle ⇒ . . .
⇓
• Lots of %user
20/66
Chapter 2
High CPU Load
and
the main question:
«who/what is to blame?»
21/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
Profiler shows «WHERE» application time is spent,
but there’s no answer to the question «WHY».
traditional profiler
22/66
Who/What is to blame?
Complex CPU microarchitecture:
• Inefficient algorithm ⇒ 100% CPU
• Pipeline stall due to memory load/stores ⇒ 100% CPU
• Pipeline flush due to mispredicted branch ⇒ 100% CPU
• Expensive instructions ⇒ 100% CPU
• Insufficient ILP ⇒ 100% CPU
• etc. ⇒ 100% CPU
Instruction Level Parallelism
23/66
Chapter 3
Hardware Counters
HWC, PMU - WTF?
24/66
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
25/66
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events
(performance monitoring events)
25/66
Events
Vendor’s documentation! (e.g. 2 pages from 32)
26/66
Events: Issues
• Hundreds of events
• Microarchitectural experience is required
• Platform dependent
– vary from CPU vendor to vendor
– may vary when CPU manufacturer introduces new microarchitecture
• How to work with HWC?
27/66
HWC
HWC modes:
• Counting mode
– if (event_happened) ++counter;
– general monitoring (answering «WHAT?» )
• Sampling mode
– if (event_happened)
if (++counter < threshold) INTERRUPT;
– profiling (answering «WHERE?»)
28/66
HWC: Issues
So many events, so few counters
e.g. “Nehalem”:
– over 900 events
– 7 HWC (3 fixed + 4 programmable)
• «multiple-running» (different events)
– Repeatability
• «multiplexing» (only a few tools are able to do that)
– Steady state
29/66
HWC: Issues (cont.)
Sampling mode:
• «instruction skid»
(hard to correlate event and instruction)
• Uncore events
(hard to bind event and execution thread;
e.g. shared L3 cache)
30/66
HWC: typical usages
• hardware validation
• performance analysis
31/66
HWC: typical usages
• hardware validation
• performance analysis
• run-time tuning (e.g. JRockit, etc.)
• security attacks and defenses
• test code coverage
• etc.
31/66
HWC: tools
• Oracle Solaris Studio Performance Analyzer
http://www.oracle.com/technetwork/server-storage/solarisstudio
• perf/perf_events
http://perf.wiki.kernel.org
• JMH (Java Microbenchmark Harness)
http://openjdk.java.net/projects/code-tools/jmh/)
-prof perf
-prof perfnorm = perf, normalized per operation
-prof perfasm = perf + -XX:+PrintAssembly
32/66
HWC: tools (cont.)
• AMD CodeXL
• Intel® Vtune™ Amplifier
• etc. . .
33/66
perf events (e.g.)
• cycles
• instrustions
• cache-references
• cache-misses
• branches
• branch-misses
• bus-cycles
• ref-cycles
• dTLB-loads
• dTLB-load-misses
• L1-dcache-loads
• L1-dcache-load-misses
• L1-dcache-stores
• L1-dcache-store-misses
• LLC-loads
• etc...
34/66
Oracle Studio events (e.g.)
• cycles
• insts
• branch-instruction-retired
• branch-misses-retired
• dtlbm
• l1h
• l1m
• l2h
• l2m
• l3h
• l3m
• etc...
35/66
Chapter 4
I’ve got HWC data. What’s next?
or
Introduction to microarchitecture
performance analysis
36/66
Execution time
tme =
cyces
ƒreqency
Optimization is. . .
reducing the (spent) cycle count!
everything else is overclocking
37/66
Microarchitecture Equation
cyces = PthLength ∗ CP = PthLength ∗ 1
PC
• PthLength - number of instructions
• CP - cycles per instruction
• PC - instructions per cycle
38/66
PthLength ∗ CP
• PthLength ∼ algorithm efficiency (the smaller the better)
• CP ∼ CPU efficiency (the smaller the better)
– CP = 4 – bad!
– CP = 1
• Nehalem – just good enough!
• SandyBridge and later – not so good!
– CP = 0.4 – good!
– CP = 0.2 – ideal!
39/66
What to do?
• low CP
– reduce PthLength → «tune algorithm»
• high CP ⇒ CPU stalls
– memory stalls → «tune data structures»
– branch stalls → «tune control logic»
– instruction dependency → «break dependency chains»
– long latency ops → «use more simple operations»
– etc. . .
40/66
High CPI: Memory bound
• dTLB misses
• L1,L2,L3,...,LN misses
• NUMA: non-local memory access
• memory bandwidth
• false/true sharing
• cache line split (not in Java world, except. . . )
• store forwarding (unlikely, hard to fix on Java level)
• 4K aliasing
41/66
High CPI: Core bound
• long latency operations: DIV, SQRT
• FP assist: floating points denormal, NaN, inf
• bad speculation (caused by mispredicted branch)
• port saturation (forget it)
42/66
High CPI: Front-End bound
• iTLB miss
• iCache miss
• branch mispredict
• LSD (loop stream decoder)
solvable by HotSpot tweaking
43/66
Example 2
Some «large» standard server-side
Java benchmark
44/66
Example 2: perf stat -d
880851.993237 task-clock (msec)
39,318 context-switches
437 cpu-migrations
7,931 page-faults
2,277,063,113,376 cycles
1,226,299,634,877 instructions # 0.54 insns per cycle
229,500,265,931 branches # 260.544 M/sec
4,620,666,169 branch-misses # 2.01% of all branches
338,169,489,902 L1-dcache-loads # 383.912 M/sec
37,937,596,505 L1-dcache-load-misses # 11.22% of all L1-dcache hits
25,232,434,666 LLC-loads # 28.645 M/sec
7,307,884,874 L1-icache-load-misses # 0.00% of all L1-icache hits
337,730,278,697 dTLB-loads # 382.846 M/sec
6,094,356,801 dTLB-load-misses # 1.81% of all dTLB cache hits
12,210,841,909 iTLB-loads # 13.863 M/sec
431,803,270 iTLB-load-misses # 3.54% of all iTLB cache hits
301.557213044 seconds time elapsed
45/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Potential Issues?
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Let’s count:
4×(338 × 109
)+
12×((38− 25)× 109
)+
36×(25 × 109
) ≈
2408 × 109
cycles ???
Potential Issues?
Cache latencies (L1/L2/L3) = 4/12/36 cycles
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Let’s count:
4×(338 × 109
)+
12×((38− 25)× 109
)+
36×(25 × 109
) ≈
2408 × 109
cycles ???Superscalar in Action!
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Issue!
46/66
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
• a memory cache that stores recent translations of virtual
memory addresses to physical addresses
• each memory access → access to TLB
• TLB miss may take hundreds cycles
47/66
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
• a memory cache that stores recent translations of virtual
memory addresses to physical addresses
• each memory access → access to TLB
• TLB miss may take hundreds cycles
How to check?
• dtlb_load_misses_miss_causes_a_walk
• dtlb_load_misses_walk_duration
47/66
Example 2: dTLB misses
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
dTLB-walks-duration ×109 296
13%
of cycles (time)
in cycles
48/66
Example 2: dTLB misses
Fixing it:
• Try to shrink application working set
(modify you application)
• Enable -XX:+UseLargePages
(modify you execution scripts)
49/66
Example 2: dTLB misses
Fixing it:
• Enable -XX:+UseLargePages
(modify you execution scripts)
20% boost on the benchmark!
49/66
Example 2: -XX:+UseLargePages
baseline large pages
CPI 1.85 1.56
cycles ×109 2277 2277
instructions ×109 1226 1460
L1-dcache-loads ×109 338 401
L1-dcache-load-misses ×109 38 38
LLC-loads ×109 25 25
dTLB-loads ×109 338 401
dTLB-load-misses ×109 6 0.24
dTLB-walks-duration ×109 296 2.6
50/66
Example 2: normalized per transaction
baseline large pages
CPI 1.85 1.56
cycles ×106 23.4 19.5
instructions ×106 12.5 12.5
L1-dcache-loads ×106 3.45 3.45
L1-dcache-load-misses ×106 0.39 0.33
LLC-loads ×106 0.26 0.21
dTLB-loads ×106 3.45 3.45
dTLB-load-misses ×106 0.06 0.002
dTLB-walks-duration ×106 3.04 0.022
51/66
Example 3
«A plague on both your houses»
«++ on both your threads»
52/66
Example 3: False Sharing
• False Sharing:
https://en.wikipedia.org/wiki/False_sharing
• @Contended (JEP 142)
http://openjdk.java.net/jeps/142
53/66
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int field1;
}
@Benchmark
@Group("baseline")
public int justDoIt(StateBaseline s) {
return s.field0++;
}
@Benchmark
@Group("baseline")
public int doItFromOtherThread(StateBaseline s) {
return s.field1++;
}
54/66
Example 3: Measure
resource sharing padded
Same Core (HT) L1 9.5 4.9
Diff Cores (within socket) L3 (LLC) 10.6 2.8
Diff Sockets nothing 18.2 2.8
average time, ns/op
shared between threads
55/66
Example 3: What do HWC tell us?
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
L1-loads 12593 9467 9696 8973 9672 9016
L1-load-misses 10 5 12 4 33 3
L1-stores 4317 7838 7433 4069 6935 4074
L1-store-misses 5 2 161 2 55 1
LLС-loads 4 3 58 1 32 1
LLC-load-misses 1 1 53 ≈ 0 35 ≈ 0
LLC-stores 1 1 183 ≈ 0 49 ≈ 0
LLC-store-misses 1 ≈ 0 182 ≈ 0 48 ≈ 0
All values are normalized per 103 operations
56/66
Example 3: «on a core and a prayer»
in case of the single core(HT) we have to look into
MACHINE_CLEARS.MEMORY_ORDERING
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
CLEARS 238 ≈ 0 ≈ 0 ≈ 0 ≈ 0 ≈ 0
57/66
Example 3: Diff Cores
L2_STORE_LOCK_RQSTS - L2 RFOs breakdown
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈ 0 49 ≈ 0
LLC-store-misses 182 ≈ 0 48 ≈ 0
L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0
L2_STORE_LOCK_RQSTS.HIT_E ≈ 0 ≈ 0 ≈ 0 ≈ 0
L2_STORE_LOCK_RQSTS.HIT_M ≈ 0 ≈ 0 ≈ 0 ≈ 0
L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0
58/66
Question!
To the audience
59/66
Example 3: Diff Cores
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈ 0 49 ≈ 0
LLC-store-misses 182 ≈ 0 48 ≈ 0
L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0
L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0
Why 183 > 49 & 134 > 33,
but the same socket case is faster?
60/66
Example 3: Some events count duration
For example:
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
cycles 36012 9163 46484 9608
instructions 26550 25747 26717 25768
O_R_O.CYCLES_W_D_RFO 21723 10 29601 56
61/66
Summary
Summary: "Performance is easy"
To achieve high performance:
• You have to know your Application!
• You have to know your Frameworks!
• You have to know your Virtual Machine!
• You have to know your Operating System!
• You have to know your Hardware!
63/66
Enlarge your knowledge with these simple tricks!
Reading list:
• “Computer Architecture: A Quantitative Approach”
John L. Hennessy, David A. Patterson
• CPU vendors documentation
• http://www.agner.org/optimize/
• http://www.google.com/search?q=Hardware+
performance+counter
• etc. . .
64/66
Thanks!
65/66
Q & A ?
66/66
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
pmu-hwc-java

More Related Content

PDF
Java Performance: Speedup your application with hardware counters
PDF
"Quantum" performance effects
PDF
"Quantum" Performance Effects
PDF
JDK8: Stream style
PPTX
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
PDF
Facebook Glow Compiler のソースコードをグダグダ語る会
PDF
JVM Mechanics
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Java Performance: Speedup your application with hardware counters
"Quantum" performance effects
"Quantum" Performance Effects
JDK8: Stream style
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Facebook Glow Compiler のソースコードをグダグダ語る会
JVM Mechanics
Bridge TensorFlow to run on Intel nGraph backends (v0.5)

What's hot (20)

PDF
Java9を迎えた今こそ!Java本格(再)入門
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
PPT
Troubleshooting Linux Kernel Modules And Device Drivers
PDF
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
PDF
Performance Wins with eBPF: Getting Started (2021)
PPT
Threaded Programming
PPTX
Machine Learning Model Bakeoff
PPTX
Ember
PDF
Global Interpreter Lock: Episode I - Break the Seal
PDF
Joel Falcou, Boost.SIMD
PDF
Profiling your Applications using the Linux Perf Tools
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
PDF
TVM VTA (TSIM)
PPTX
Patching: answers to questions you probably were afraid to ask about oracle s...
PPTX
Return Oriented Programming (ROP) Based Exploits - Part I
PDF
Non-blocking synchronization — what is it and why we (don't?) need it
PDF
YOW2020 Linux Systems Performance
PPT
How Many Slaves (Ukoug)
PDF
Multiply your Testing Effectiveness with Parameterized Testing, v1
PPT
DTrace - Miracle Scotland Database Forum
Java9を迎えた今こそ!Java本格(再)入門
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Troubleshooting Linux Kernel Modules And Device Drivers
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
Performance Wins with eBPF: Getting Started (2021)
Threaded Programming
Machine Learning Model Bakeoff
Ember
Global Interpreter Lock: Episode I - Break the Seal
Joel Falcou, Boost.SIMD
Profiling your Applications using the Linux Perf Tools
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
TVM VTA (TSIM)
Patching: answers to questions you probably were afraid to ask about oracle s...
Return Oriented Programming (ROP) Based Exploits - Part I
Non-blocking synchronization — what is it and why we (don't?) need it
YOW2020 Linux Systems Performance
How Many Slaves (Ukoug)
Multiply your Testing Effectiveness with Parameterized Testing, v1
DTrace - Miracle Scotland Database Forum
Ad

Similar to Speedup Your Java Apps with Hardware Counters (20)

PDF
Caching in (DevoxxUK 2013)
PDF
Caching in
PDF
Performance and predictability (1)
PDF
Performance and Predictability - Richard Warburton
PPTX
Computer Architecture and Organization
PPTX
Use Data-Oriented Design to write efficient code
PDF
Performance and predictability
PDF
PDF
PPU Optimisation Lesson
PDF
Performance and predictability
PDF
Parallel Application Performance Prediction of Using Analysis Based Modeling
PPTX
Code and memory optimization tricks
PPTX
Code and Memory Optimisation Tricks
PDF
Introduction to Java Profiling
PPTX
Go Native : Squeeze the juice out of your 64-bit processor using C++
PPTX
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
PDF
L06-handout.pdf
PDF
Caching in
PDF
Performance and memory profiling for embedded system design
PDF
This is Unit 1 of High Performance Computing For SRM students
Caching in (DevoxxUK 2013)
Caching in
Performance and predictability (1)
Performance and Predictability - Richard Warburton
Computer Architecture and Organization
Use Data-Oriented Design to write efficient code
Performance and predictability
PPU Optimisation Lesson
Performance and predictability
Parallel Application Performance Prediction of Using Analysis Based Modeling
Code and memory optimization tricks
Code and Memory Optimisation Tricks
Introduction to Java Profiling
Go Native : Squeeze the juice out of your 64-bit processor using C++
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
L06-handout.pdf
Caching in
Performance and memory profiling for embedded system design
This is Unit 1 of High Performance Computing For SRM students
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Speedup Your Java Apps with Hardware Counters

  • 1. Java Performance: Speedup your applications with hardware counters Sergey Kuksenko sergey.kuksenko@oracle.com, @kuksenk0
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ pmu-hwc-java
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2016, Oracle and/or its affiliates. All rights reserved. 2/66
  • 5. Intriguing Introductory Example Gotcha, here is my hot method or On Algorithmic O ptimizations just-O 3/66
  • 6. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } 4/66
  • 7. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } O (N3) big-O 4/66
  • 8. Example 1: "Ok Google" O(N2.81) 5/66
  • 9. Example 1: Very Hot Code N multiply strassen 128 0.0026 0.0023 512 0.48 0.12 2048 28 5.8 8192 3571 282 time; seconds/op 6/66
  • 10. Example 1: Optimized speedup over baseline 7/66
  • 11. Try the other way 8/66
  • 12. Example 1: JMH, -prof perfnorm, N=256 per multiplication per iteration CPI 1.06 cycles 146 × 106 8.7 instructions 137 × 106 8.2 L1-loads 68 × 106 4 L1-load-misses 33 × 106 2 L1-stores 0.56 × 106 L1-store-misses 38 × 103 LLC-loads 26 × 106 1.6 LLC-stores 11.6 × 103 ∼ 256Kb per matrix 9/66
  • 13. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } L1-load-misses 10/66
  • 14. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } 10/66
  • 15. Example 1: Very Hot Code public int[][] multiplyIKJ(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int k = 0; k < size; k++) { int aik = A[i][k]; for (int j = 0; j < size; j++) { R[i][j] += aik * B[k][j]; } } } return R; } 11/66
  • 16. Example 1: Very Hot Code IJK IKJ N multiply strassen multiply 128 0.0026 0.0023 0.0005 512 0.48 0.12 0.03 2048 28 5.8 2 8192 3571 282 264 time; seconds/op 12/66
  • 17. Example 1: Very Hot Code IJK IKJ N multiply strassen multiply strassen 128 0.0026 0.0023 0.0005 512 0.48 0.12 0.03 0.02 2048 28 5.8 2 1.5 8192 3571 282 264 79 time; seconds/op 12/66
  • 18. Example 1: Optimized speedup over baseline 13/66
  • 19. Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 × 106 8.7 9.7 × 106 0.6 instructions 137 × 106 8.2 19 × 106 1.1 L1-loads 68 × 106 4 5.4 × 106 0.3 L1-load-misses 33 × 106 2 1.1 × 106 0.1 L1-stores 0.56 × 106 2.7 × 106 0.2 L1-store-misses 38 × 103 9 × 103 LLC-loads 26 × 106 1.6 0.3 × 106 LLC-stores 11.6 × 103 3.5 × 103 14/66
  • 20. Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 × 106 8.7 9.7 × 106 0.6 instructions 137 × 106 8.2 19 × 106 1.1 L1-loads 68 × 106 4 5.4 × 106 0.3 L1-load-misses 33 × 106 2 1.1 × 106 0.1 L1-stores 0.56 × 106 2.7 × 106 0.2 L1-store-misses 38 × 103 9 × 103 LLC-loads 26 × 106 1.6 0.3 × 106 LLC-stores 11.6 × 103 3.5 × 103 ? 14/66
  • 21. Example 1: Free beer benefits! cycles insts ... 0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5 8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5 43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5 6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore 19.82% 20.82% add $0x8,%edx ;*iinc 1.46% 2.75% cmp %ecx,%edx jl <addr> ;*if_icmpge ... how to get asm: JMH, -prof perfasm 15/66
  • 22. Example 1: Free beer benefits! cycles insts ... 0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5 8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5 43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5 6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore 19.82% 20.82% add $0x8,%edx ;*iinc 1.46% 2.75% cmp %ecx,%edx jl <addr> ;*if_icmpge ... Vectorization (SSE/AVX)! how to get asm: JMH, -prof perfasm 15/66
  • 23. Chapter 1 Performance Optimization Methodology: A Very Short Introduction. Three magic questions and the direction of the journey 16/66
  • 25. Three magic questions What? • What prevents my application to work faster? (monitoring) 17/66
  • 26. Three magic questions What? ⇒ Where? • What prevents my application to work faster? (monitoring) • Where does it hide? (profiling) 17/66
  • 27. Three magic questions What? ⇒ Where? ⇒ How? • What prevents my application to work faster? (monitoring) • Where does it hide? (profiling) • How to stop it messing with performance? (tuning/optimizing) 17/66
  • 28. Top-Down Approach • System Level – Network, Disk, OS, CPU/Memory • JVM Level – GC/Heap, JIT, Classloading • Application Level – Algorithms, Synchronization, Threading, API • Microarchitecture Level – Caches, Data/Code alignment, CPU Pipeline Stalls 18/66
  • 31. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) 20/66
  • 32. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ 20/66
  • 33. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ 20/66
  • 34. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ 20/66
  • 35. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ • Lots of %idle ⇒ . . . ⇓ 20/66
  • 36. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ • Lots of %idle ⇒ . . . ⇓ • Lots of %user 20/66
  • 37. Chapter 2 High CPU Load and the main question: «who/what is to blame?» 21/66
  • 38. CPU Utilization • What does ∼100% CPU Utilization mean? 22/66
  • 39. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule 22/66
  • 40. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? 22/66
  • 41. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? Profiler shows «WHERE» application time is spent, but there’s no answer to the question «WHY». traditional profiler 22/66
  • 42. Who/What is to blame? Complex CPU microarchitecture: • Inefficient algorithm ⇒ 100% CPU • Pipeline stall due to memory load/stores ⇒ 100% CPU • Pipeline flush due to mispredicted branch ⇒ 100% CPU • Expensive instructions ⇒ 100% CPU • Insufficient ILP ⇒ 100% CPU • etc. ⇒ 100% CPU Instruction Level Parallelism 23/66
  • 44. PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. 25/66
  • 45. PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. PMU Internals (in less than 21 seconds) : Hardware counters (HWC) count hardware performance events (performance monitoring events) 25/66
  • 46. Events Vendor’s documentation! (e.g. 2 pages from 32) 26/66
  • 47. Events: Issues • Hundreds of events • Microarchitectural experience is required • Platform dependent – vary from CPU vendor to vendor – may vary when CPU manufacturer introduces new microarchitecture • How to work with HWC? 27/66
  • 48. HWC HWC modes: • Counting mode – if (event_happened) ++counter; – general monitoring (answering «WHAT?» ) • Sampling mode – if (event_happened) if (++counter < threshold) INTERRUPT; – profiling (answering «WHERE?») 28/66
  • 49. HWC: Issues So many events, so few counters e.g. “Nehalem”: – over 900 events – 7 HWC (3 fixed + 4 programmable) • «multiple-running» (different events) – Repeatability • «multiplexing» (only a few tools are able to do that) – Steady state 29/66
  • 50. HWC: Issues (cont.) Sampling mode: • «instruction skid» (hard to correlate event and instruction) • Uncore events (hard to bind event and execution thread; e.g. shared L3 cache) 30/66
  • 51. HWC: typical usages • hardware validation • performance analysis 31/66
  • 52. HWC: typical usages • hardware validation • performance analysis • run-time tuning (e.g. JRockit, etc.) • security attacks and defenses • test code coverage • etc. 31/66
  • 53. HWC: tools • Oracle Solaris Studio Performance Analyzer http://www.oracle.com/technetwork/server-storage/solarisstudio • perf/perf_events http://perf.wiki.kernel.org • JMH (Java Microbenchmark Harness) http://openjdk.java.net/projects/code-tools/jmh/) -prof perf -prof perfnorm = perf, normalized per operation -prof perfasm = perf + -XX:+PrintAssembly 32/66
  • 54. HWC: tools (cont.) • AMD CodeXL • Intel® Vtune™ Amplifier • etc. . . 33/66
  • 55. perf events (e.g.) • cycles • instrustions • cache-references • cache-misses • branches • branch-misses • bus-cycles • ref-cycles • dTLB-loads • dTLB-load-misses • L1-dcache-loads • L1-dcache-load-misses • L1-dcache-stores • L1-dcache-store-misses • LLC-loads • etc... 34/66
  • 56. Oracle Studio events (e.g.) • cycles • insts • branch-instruction-retired • branch-misses-retired • dtlbm • l1h • l1m • l2h • l2m • l3h • l3m • etc... 35/66
  • 57. Chapter 4 I’ve got HWC data. What’s next? or Introduction to microarchitecture performance analysis 36/66
  • 58. Execution time tme = cyces ƒreqency Optimization is. . . reducing the (spent) cycle count! everything else is overclocking 37/66
  • 59. Microarchitecture Equation cyces = PthLength ∗ CP = PthLength ∗ 1 PC • PthLength - number of instructions • CP - cycles per instruction • PC - instructions per cycle 38/66
  • 60. PthLength ∗ CP • PthLength ∼ algorithm efficiency (the smaller the better) • CP ∼ CPU efficiency (the smaller the better) – CP = 4 – bad! – CP = 1 • Nehalem – just good enough! • SandyBridge and later – not so good! – CP = 0.4 – good! – CP = 0.2 – ideal! 39/66
  • 61. What to do? • low CP – reduce PthLength → «tune algorithm» • high CP ⇒ CPU stalls – memory stalls → «tune data structures» – branch stalls → «tune control logic» – instruction dependency → «break dependency chains» – long latency ops → «use more simple operations» – etc. . . 40/66
  • 62. High CPI: Memory bound • dTLB misses • L1,L2,L3,...,LN misses • NUMA: non-local memory access • memory bandwidth • false/true sharing • cache line split (not in Java world, except. . . ) • store forwarding (unlikely, hard to fix on Java level) • 4K aliasing 41/66
  • 63. High CPI: Core bound • long latency operations: DIV, SQRT • FP assist: floating points denormal, NaN, inf • bad speculation (caused by mispredicted branch) • port saturation (forget it) 42/66
  • 64. High CPI: Front-End bound • iTLB miss • iCache miss • branch mispredict • LSD (loop stream decoder) solvable by HotSpot tweaking 43/66
  • 65. Example 2 Some «large» standard server-side Java benchmark 44/66
  • 66. Example 2: perf stat -d 880851.993237 task-clock (msec) 39,318 context-switches 437 cpu-migrations 7,931 page-faults 2,277,063,113,376 cycles 1,226,299,634,877 instructions # 0.54 insns per cycle 229,500,265,931 branches # 260.544 M/sec 4,620,666,169 branch-misses # 2.01% of all branches 338,169,489,902 L1-dcache-loads # 383.912 M/sec 37,937,596,505 L1-dcache-load-misses # 11.22% of all L1-dcache hits 25,232,434,666 LLC-loads # 28.645 M/sec 7,307,884,874 L1-icache-load-misses # 0.00% of all L1-icache hits 337,730,278,697 dTLB-loads # 382.846 M/sec 6,094,356,801 dTLB-load-misses # 1.81% of all dTLB cache hits 12,210,841,909 iTLB-loads # 13.863 M/sec 431,803,270 iTLB-load-misses # 3.54% of all iTLB cache hits 301.557213044 seconds time elapsed 45/66
  • 67. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 46/66
  • 68. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Potential Issues? 46/66
  • 69. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4×(338 × 109 )+ 12×((38− 25)× 109 )+ 36×(25 × 109 ) ≈ 2408 × 109 cycles ??? Potential Issues? Cache latencies (L1/L2/L3) = 4/12/36 cycles 46/66
  • 70. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4×(338 × 109 )+ 12×((38− 25)× 109 )+ 36×(25 × 109 ) ≈ 2408 × 109 cycles ???Superscalar in Action! 46/66
  • 71. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Issue! 46/66
  • 72. Example 2: dTLB misses TLB = Translation Lookaside Buffer • a memory cache that stores recent translations of virtual memory addresses to physical addresses • each memory access → access to TLB • TLB miss may take hundreds cycles 47/66
  • 73. Example 2: dTLB misses TLB = Translation Lookaside Buffer • a memory cache that stores recent translations of virtual memory addresses to physical addresses • each memory access → access to TLB • TLB miss may take hundreds cycles How to check? • dtlb_load_misses_miss_causes_a_walk • dtlb_load_misses_walk_duration 47/66
  • 74. Example 2: dTLB misses CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 dTLB-walks-duration ×109 296 13% of cycles (time) in cycles 48/66
  • 75. Example 2: dTLB misses Fixing it: • Try to shrink application working set (modify you application) • Enable -XX:+UseLargePages (modify you execution scripts) 49/66
  • 76. Example 2: dTLB misses Fixing it: • Enable -XX:+UseLargePages (modify you execution scripts) 20% boost on the benchmark! 49/66
  • 77. Example 2: -XX:+UseLargePages baseline large pages CPI 1.85 1.56 cycles ×109 2277 2277 instructions ×109 1226 1460 L1-dcache-loads ×109 338 401 L1-dcache-load-misses ×109 38 38 LLC-loads ×109 25 25 dTLB-loads ×109 338 401 dTLB-load-misses ×109 6 0.24 dTLB-walks-duration ×109 296 2.6 50/66
  • 78. Example 2: normalized per transaction baseline large pages CPI 1.85 1.56 cycles ×106 23.4 19.5 instructions ×106 12.5 12.5 L1-dcache-loads ×106 3.45 3.45 L1-dcache-load-misses ×106 0.39 0.33 LLC-loads ×106 0.26 0.21 dTLB-loads ×106 3.45 3.45 dTLB-load-misses ×106 0.06 0.002 dTLB-walks-duration ×106 3.04 0.022 51/66
  • 79. Example 3 «A plague on both your houses» «++ on both your threads» 52/66
  • 80. Example 3: False Sharing • False Sharing: https://en.wikipedia.org/wiki/False_sharing • @Contended (JEP 142) http://openjdk.java.net/jeps/142 53/66
  • 81. Example 3: False Sharing @State(Scope.Group) public static class StateBaseline { int field0; int field1; } @Benchmark @Group("baseline") public int justDoIt(StateBaseline s) { return s.field0++; } @Benchmark @Group("baseline") public int doItFromOtherThread(StateBaseline s) { return s.field1++; } 54/66
  • 82. Example 3: Measure resource sharing padded Same Core (HT) L1 9.5 4.9 Diff Cores (within socket) L3 (LLC) 10.6 2.8 Diff Sockets nothing 18.2 2.8 average time, ns/op shared between threads 55/66
  • 83. Example 3: What do HWC tell us? Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 L1-loads 12593 9467 9696 8973 9672 9016 L1-load-misses 10 5 12 4 33 3 L1-stores 4317 7838 7433 4069 6935 4074 L1-store-misses 5 2 161 2 55 1 LLС-loads 4 3 58 1 32 1 LLC-load-misses 1 1 53 ≈ 0 35 ≈ 0 LLC-stores 1 1 183 ≈ 0 49 ≈ 0 LLC-store-misses 1 ≈ 0 182 ≈ 0 48 ≈ 0 All values are normalized per 103 operations 56/66
  • 84. Example 3: «on a core and a prayer» in case of the single core(HT) we have to look into MACHINE_CLEARS.MEMORY_ORDERING Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 CLEARS 238 ≈ 0 ≈ 0 ≈ 0 ≈ 0 ≈ 0 57/66
  • 85. Example 3: Diff Cores L2_STORE_LOCK_RQSTS - L2 RFOs breakdown Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈ 0 49 ≈ 0 LLC-store-misses 182 ≈ 0 48 ≈ 0 L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0 L2_STORE_LOCK_RQSTS.HIT_E ≈ 0 ≈ 0 ≈ 0 ≈ 0 L2_STORE_LOCK_RQSTS.HIT_M ≈ 0 ≈ 0 ≈ 0 ≈ 0 L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0 58/66
  • 87. Example 3: Diff Cores Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈ 0 49 ≈ 0 LLC-store-misses 182 ≈ 0 48 ≈ 0 L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0 L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0 Why 183 > 49 & 134 > 33, but the same socket case is faster? 60/66
  • 88. Example 3: Some events count duration For example: OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 cycles 36012 9163 46484 9608 instructions 26550 25747 26717 25768 O_R_O.CYCLES_W_D_RFO 21723 10 29601 56 61/66
  • 90. Summary: "Performance is easy" To achieve high performance: • You have to know your Application! • You have to know your Frameworks! • You have to know your Virtual Machine! • You have to know your Operating System! • You have to know your Hardware! 63/66
  • 91. Enlarge your knowledge with these simple tricks! Reading list: • “Computer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson • CPU vendors documentation • http://www.agner.org/optimize/ • http://www.google.com/search?q=Hardware+ performance+counter • etc. . . 64/66
  • 93. Q & A ? 66/66
  • 94. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ pmu-hwc-java