SlideShare a Scribd company logo
Java Mixed-Mode
Flame Graphs
Brendan Gregg Senior Performance Architect
Oct	
  2015	
  
Understanding Java CPU usage
quickly and completely
Quickly
•  Via SSH and open source tools (covered in this talk)
•  Or using Netflix Vector GUI (also open source):
1.  Observe high CPU usage
2.  Generate a flame graph
Java Mixed-Mode Flame Graph
via Linux perf_events:
Completely
Java JVM
Kernel
GC
Messy House Fallacy
•  Don't overlook system code: kernel, libraries, etc.
Fallacy:	
  my	
  code	
  is	
  a	
  mess,	
  I	
  bet	
  yours	
  is	
  
immaculate,	
  therefore	
  the	
  bug	
  must	
  be	
  mine	
  
	
  
Reality:	
  everyone's	
  code	
  is	
  terrible	
  and	
  buggy	
  
Context
•  Over 60 million subscribers
–  Just launched in Spain!
•  AWS EC2 Linux cloud
•  FreeBSD CDN
•  Awesome place to work
Cloud
•  Tens of thousands of AWS EC2 instances
•  Mostly running Java applications (Oracle JVM)
Linux	
  (usually	
  Ubuntu)	
  
Java	
  (JDK	
  8)	
  
Tomcat	
  GC	
  and	
  
thread	
  
dump	
  
logging	
  
hystrix,	
  metrics	
  (Servo),	
  
health	
  check	
  
OpMonal	
  Apache,	
  
memcached,	
  Node.js,	
  
…	
  
Atlas,	
  S3	
  log	
  rotaMon,	
  
sar,	
  Trace,	
  perf,	
  stap,	
  
perf-­‐tools	
  
Vector,	
  pcp	
  
ApplicaMon	
  war	
  files,	
  
plaYorm,	
  base	
  servlet	
  
Why we need CPU profiling
•  Improving performance
–  Identify tuning targets
–  Incident response
–  Non-regression testing
–  Software evaluations
–  CPU workload
characterization
•  Cost savings
–  ASGs often scale on load
average (CPU), so CPU
usage is proportional to cost
Instance	
  
Instance	
  
Instance	
  
Scaling	
  Policy	
  
loadavg,	
  latency,	
  …	
  
	
  
CloudWatch,	
  Servo	
  
Auto	
  Scaling	
  
Group	
  
The Problem with Profilers
Java Profilers
Java
GC
Kernel,
libraries,
JVM
Java Profilers
•  Visibility
–  Java method execution
–  Object usage
–  GC logs
–  Custom Java context
•  Typical problems:
–  Sampling often happens at safety/yield points (skew)
–  Method tracing has massive observer effect
–  Misidentifies RUNNING as on-CPU (e.g., epoll)
–  Doesn't include or profile GC or JVM CPU time
–  Tree views not quick (proportional) to comprehend
•  Inaccurate (skewed) and incomplete profiles
System Profilers
Java Kernel
TCP/IP
GC
Idle
thread
Time
Locks epoll
JVM
System Profilers
•  Visibility
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Typical problems (x86):
–  Stacks missing for Java
–  Symbols missing for Java methods
•  Other architectures (e.g., SPARC) have fared better
•  Profile everything except Java
Workaround
•  Capture both Java and system profiles, and examine
side by side
•  An improvement, but Java context is often crucial for
interpreting system profiles
Java System
Java Mixed-Mode Flame Graph
Solution
Java JVM
Kernel
GC
Solution
•  Fix system profiling
–  Only way to see it all
•  Visibility is everything:
–  Java methods
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Minor Problems:
–  0-3% CPU overhead to enable frame pointers (usually <1%).
–  Symbol dumps can consume a burst of CPU
•  Complete and accurate (asynchronous) profiling
Java
JVM
Kernel
GC
Simple Production Example
1.  Poor performance,
and one CPU at 100%
2.  perf_events flame
graph shows JVM
stuck compiling
Another System Example
Exception handling consuming CPU
DEMO
FlameGraph_tomcat01.svg
Exonerating The System
•  From last week:
-  Frequent thread creation/
destruction assumed to be
consuming CPU resources.
Recode application?
-  A flame graph quantified this
CPU time: near zero
-  Time mostly other Java methods
Profiling GC
GC internals, visualized:
CPU Profiling
CPU Profiling
A
B
block interrupt
on-CPU off-CPU
A
B
A A
B
A
syscall
time
•  Record stacks at a timed interval: simple and effective
–  Pros: Low (deterministic) overhead
–  Cons: Coarse accuracy, but usually sufficient
stack
samples: A
Stack Traces
•  A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6)
at Func_abc.func_b(Func_abc.java:16)
at Func_abc.func_a(Func_abc.java:23)
at Func_abc.main(Func_abc.java:27)
running
parent
g.parent
g.g.paren
running
codepath
start
System Profilers
•  Linux
–  perf_events (aka "perf")
•  Oracle Solaris
–  DTrace
•  OS X
–  Instruments
•  Windows
–  XPerf
•  And many others…
Linux perf_events
•  Standard Linux profiler
–  Provides the perf command (multi-tool)
–  Usually pkg added by linux-tools-common, etc.
•  Features:
–  Timer-based sampling
–  Hardware events
–  Tracepoints
–  Dynamic tracing
•  Can sample stacks of (almost) everything on CPU
–  Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)
perf record Profiling
•  Stack profiling on all CPUs at 99 Hertz, then dump:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf script
[…]
bash 13204 cpu-clock:
459c4c dequote_string (/root/bash-4.3/bash)
465c80 glob_expand_word_list (/root/bash-4.3/bash)
466569 expand_word_list_internal (/root/bash-4.3/bash)
465a13 expand_words (/root/bash-4.3/bash)
43bbf7 execute_simple_command (/root/bash-4.3/bash)
435f16 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
43a771 execute_while_or_until (/root/bash-4.3/bash)
43a636 execute_while_command (/root/bash-4.3/bash)
436129 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
420cd5 reader_loop (/root/bash-4.3/bash)
41ea58 main (/root/bash-4.3/bash)
7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[… ~47,000 lines truncated …]
one
stack
sample
perf report Summary
•  Generates a call tree and combines samples:
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
call tree
summary
Flame Graphs
perf report Verbosity
•  Despite summarizing, output is still verbose
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Full perf report Output
… as a Flame Graph
Flame Graphs
•  Flame Graphs:
–  x-axis: alphabetical stack sort, to maximize merging
–  y-axis: stack depth
–  color: random (default), or a dimension
•  Currently made from Perl + SVG + JavaScript
–  Multiple d3 versions are being developed
•  Easy to get working
–  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
–  Above commands are Linux; see URL for other OSes
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
Linux perf_events Workflow
perf stat perf record
perf report perf script
count events capture stacks
text UI dump profile
stackcollapse-perf.pl
flamegraph.pl
perf.data	
  
flame graph
visualization
perf list
list events
Typical
Workflow
Flame Graph Interpretation
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (1/3)
Top edge shows who is running on-CPU,
and how much (width)
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (2/3)
Top-down shows ancestry
e.g., from g():
h()
d()
e()
i()
a()
b()
c()
f()
g()
Flame Graph Interpretation (3/3)
a()
b() h()
c()
d()
e() f()
g()
i()
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
Flame Graph Colors
•  Randomized by default
•  Can be used as a dimension. e.g.:
–  Mixed-mode flame graphs
–  Differential flame graphs
–  Search
Mixed-Mode Flame Graphs
•  Hues:
–  green == Java
–  red == system
–  yellow == C++
•  Intensity randomized
to differentiate frames
–  Or hashed based on
function name
Java JVM
Kernel
Mixed-Mode
Differential Flame Graphs
•  Hues:
–  red == more samples
–  blue == less samples
•  Intensity shows the
degree of difference
•  Used for comparing
two profiles
•  Also used for showing
other metrics: e.g., CPI
Differential
more less
Flame Graph Search
•  Color: magenta to show matched frames
search
button
Flame Charts
•  Flame charts: x-axis is time
•  Flame graphs: x-axis is population (maximize merging)
•  Final note: these are useful, but are not flame graphs
Stack Tracing
System Profiling Java on x86
•  For example,
using Linux perf
•  The stacks are
1 or 2 levels
deep, and have
junk values
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
… as a Flame Graph
Broken Java stacks
(missing frame pointer)
Why Stacks are Broken
•  On x86 (x86_64), hotspot uses
the frame pointer register (RBP)
as general purpose
•  This "compiler optimization"
breaks (simple) stack walking
•  Once upon a time, x86 had fewer
registers, and this made much more sense
•  gcc provides -fno-omit-frame-pointer to avoid
doing this, but the JVM had no such option…
Fixing Stack Walking
Possibilities:
A.  Fix frame pointer-based stack walking (the default)
–  Pros: simple, supported by many tools
–  Cons: might cost a little extra CPU
B.  Use a custom walker (likely needing kernel support)
–  Pros: full stack walking (incl. inlining) & arguments
–  Cons: custom kernel code, can cost more CPU when in use
C.  Try libunwind and DWARF
–  Even feasible with JIT?
Our current preference is (A)
Hacking OpenJDK (1/2)
•  As a proof of concept, I hacked hotspot to support an
x86_64 frame pointer
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//
-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
[...]
Remove RBP from
register pools
Hacking OpenJDK (2/2)
•  We used this patched version successfully for some limited
(and urgent) performance analysis
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Fix x86_64 function
prologues
-XX:+PreserveFramePointer
•  We shared our patch publicly
–  See "A hotspot patch for stack profiling (frame pointer)" on the
hotspot complier dev mailing list
–  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8,
and the -XX:+PreserveFramePointer option
•  Zoltán Majó (Oracle) took this on, rewrote it, and it is now:
–  In JDK 9
–  In JDK 8 update 60 build 19
–  Thanks to Zoltán, Oracle, and the other hotspot engineers for
helping get this done!
•  It might cost 0 – 3% CPU, depending on workload
Broken Java Stacks (before)
•  Check with "perf
script" to see stack
samples
•  These are 1 or 2
levels deep (junk
values)
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f417908c194 [unknown] (/tmp/perf-4458.map)
[…]
Fixed Java Stacks
•  With -XX:
+PreserveFramePointer
stacks are full, and
go all the way to
start_thread()
•  This is what the
CPUs are really
running: inlined
frames are not
present
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
Fixed Stacks Flame Graph
Java stacks
(but no symbols)
Stacks & Inlining
•  Frames may be missing (inlined)
•  Disabling inlining:
–  -XX:-Inline
–  Many more Java frames
–  Can be 80% slower!
•  May not be necessary
–  Inlined flame graphs often make
enough sense
–  Or tune -XX:MaxInlineSize and
-XX:InlineSmallCode a little to reveal more frames
•  Can even improve performance!
•  perf-map-agent (next) has experimental un-inline support
No inlining
Symbols
Missing Symbols
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken
not broken
•  Missing symbols may show up as hex; e.g., Linux perf:
Fixing Symbols
•  For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
•  This file can be created by a Java agent
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
Java Symbols for perf
•  perf-map-agent
–  https://github.com/jrudolph/perf-map-agent
–  Agent attaches and writes the /tmp file on demand (previous
versions attached on Java start, wrote continually)
–  Thanks Johannes Rudolph!
•  Use of a /tmp symbol file
–  Pros: simple, can be low overhead (snapshot on demand)
–  Cons: stale symbols
•  Using a symbol logger with perf instead
–  Patch by Stephane Eranian currently being discussed on
lkml; see "perf: add support for profiling jitted code"
Java Mixed-Mode Flame Graph
Stacks & Symbols
Java JVM
Kernel
GC
Stacks & Symbols (zoom)
Instructions
Instructions
1.  Check Java version
2.  Install perf-map-agent
3.  Set -XX:+PreserveFramePointer
4.  Profile Java
5.  Dump symbols
6.  Generate Mixed-Mode Flame Graph
Note these are unsupported: use at your own risk
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
1. Check Java Version
•  Need JDK8u60 or better
–  for -XX:+PreserveFramePointer
•  Upgrade Java if necessary
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
2. Install perf-map-agent
•  Check https://github.com/jrudolph/perf-map-agent for the
latest instructions. e.g.:
$ sudo bash
# apt-get install -y cmake
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# cd /usr/lib/jvm
# git clone --depth=1 https://github.com/jrudolph/perf-map-agent
# cd perf-map-agent
# cmake .
# make
3. Set -XX:+PreserveFramePointer
•  Needs to be set on Java startup
•  Check it is enabled (on Linux):
$ ps wwp `pgrep –n java` | grep PreserveFramePointer
4. Profile Java
•  Using Linux perf_events to profile all processes, at 99
Hertz, for 30 seconds (as root):
•  Just profile one PID (broken on some older kernels):
•  These create a perf.data file
# perf record -F 99 -a -g -- sleep 30
# perf record -F 99 -p PID -g -- sleep 30
5. Dump Symbols
•  See perf-map-agent docs for updated usage
•  e.g., as the same user as java:
•  perf-map-agent contains helper scripts. I wrote my own:
–  https://github.com/brendangregg/Misc/blob/master/java/jmaps
•  Dump symbols quickly after perf record to minimize stale
symbols. How I do it:
$ cd /usr/lib/jvm/perf-map-agent/out
$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar 
net.virtualvoid.perf.AttachOnce PID
# perf record -F 99 -a -g -- sleep 30; jmaps
6. Generate a Mixed-Mode Flame Graph
•  Using my FlameGraph software:
–  perf script reads perf.data with /tmp/*.map
–  out.stacks01 is an intermediate file; can be handy to keep
•  Finally open flame01.svg in a browser
•  Check for newer flame graph implementations (e.g., d3)
# perf script > out.stacks01
# git clone --depth=1 https://github.com/brendangregg/FlameGraph
# cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | 
./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
Automation
Netflix Vector
Netflix Vector
Near real-time,
per-second metrics
Flame Graphs
Select
Metrics
Select Instance
Netflix Vector
•  Open source, on-demand, instance analysis tool
–  https://github.com/netflix/vector
•  Shows various real-time metrics
•  Flame graph support currently in development
–  Automating previous steps
–  Using it internally already
–  Also developing a new d3 front end
DEMO
d3-flame-graph
Advanced Analysis
Linux perf_events Coverage
… all possible with Java stacks
Advanced Flame Graphs
•  Examples:
–  Page faults
–  Context switches
–  Disk I/O requests
–  TCP events
–  CPU cache misses
–  CPI
•  Any event issued in synchronous Java context
Synchronous Java Context
•  Java thread still on-CPU, and event is directly triggered
•  Examples:
–  Disk I/O requests issued directly by Java à yes
•  direct reads, sync writes, page faults
–  Disk I/O completion interrupts à no*
–  Disk I/O requests triggered async, e.g., readahead à no*
* can be made yes by tracing and associating context
Page Faults
•  Show what triggered main memory (resident) to grow:
•  "fault" as (physical) main memory is allocated on-
demand, when a virtual page is first populated
•  Low overhead tool to solve some types of memory leak
# perf record -e page-faults -p PID -g -- sleep 120
RES column in top(1) grows
because
Page Fault Flame Graph
GC
Java code
epoll
Context Switches
•  Show why Java blocked and stopped running on-CPU:
•  Identifies locks, I/O, sleeps
–  If code path shouldn't block and looks random, it's an involuntary context switch. I
could filter these, but you should have solved them beforehand (CPU load).
•  e.g., was used to understand framework differences:
# perf record -e context-switches -p PID -g -- sleep 5
vs
rxNetty Tomcat
Context Switch Flame Graph (1/2)
rxNetty
epoll futex
Context Switch Flame Graph (2/2)
Tomcat sys_poll
futex
Disk I/O Requests
•  Shows who issued disk I/O (sync reads & writes):
•  e.g.: page faults in GC? This JVM has swapped out!:
# perf record -e block:block_rq_insert -a -g -- sleep 60
GC
TCP Events
•  TCP transmit, using dynamic tracing:
•  Note: can be high overhead for high packet rates
–  For the current perf trace, dump, post-process cycle
•  Can also trace TCP connect & accept (lower overhead)
•  TCP receive is async
–  Could trace via socket read
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks
# perf probe --del tcp_sendmsg
TCP Send Flame Graph
kernel
Java
JVM
Only one code-path
taken in this example
ab (client process)
CPU Cache Misses
•  In this example, sampling via Last Level Cache loads:
•  -c is the count (samples
once per count)
•  Use other CPU counters to
sample hits, misses, stalls
# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
One Last Example
•  Back to a
mixed-mode
CPU flame graph
•  What else can we
show with color?
CPI Flame Graph
•  Cycles Per
Instruction!
–  red == instruction
heavy
–  blue == cycle
heavy (likely mem
stall cycles)
zoomed:
Links & References
•  Flame Graphs
–  http://www.brendangregg.com/flamegraphs.html
–  http://techblog.netflix.com/2015/07/java-in-flames.html
–  http://techblog.netflix.com/2014/11/nodejs-in-flames.html
–  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
•  Linux perf_events
–  https://perf.wiki.kernel.org/index.php/Main_Page
–  http://www.brendangregg.com/perf.html
–  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
•  Netflix Vector
–  https://github.com/netflix/vector
–  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  JDK tickets
–  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465
–  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945
•  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
Thanks
•  Questions?
•  http://techblog.netflix.com
•  http://slideshare.net/brendangregg
•  http://www.brendangregg.com
•  bgregg@netflix.com
•  @brendangregg
Oct	
  2015	
  

More Related Content

PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
Java Performance Analysis on Linux with Flame Graphs
PDF
Linux Systems Performance 2016
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
PDF
Secrets of Performance Tuning Java on Kubernetes
ODP
スレッドダンプの読み方
PDF
Unified JVM Logging
PDF
Hadoop and Kerberos
Kernel Recipes 2017: Using Linux perf at Netflix
Java Performance Analysis on Linux with Flame Graphs
Linux Systems Performance 2016
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Secrets of Performance Tuning Java on Kubernetes
スレッドダンプの読み方
Unified JVM Logging
Hadoop and Kerberos

What's hot (20)

PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
PDF
プログラマ目線から見たRDMAのメリットと その応用例について
PDF
Linux Profiling at Netflix
PPTX
Open source apm scouter를 통한 관제 관리 jadecross 정환열 수석
PDF
Linux BPF Superpowers
PDF
Blazing Performance with Flame Graphs
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Velocity 2015 linux perf tools
PDF
10分で分かるLinuxブロックレイヤ
PDF
Scale Kubernetes to support 50000 services
PDF
Cilium - Fast IPv6 Container Networking with BPF and XDP
PDF
ゼロからはじめるKVM超入門
PDF
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
PDF
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PDF
BPF: Tracing and more
PDF
5 Steps to PostgreSQL Performance
PDF
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
PDF
Machine Learning Teams - Full Stack Deep Learning
PDF
UnboundとDNSSEC(OSC2011 Tokyo/Spring)
PDF
The Linux Block Layer - Built for Fast Storage
Linux 4.x Tracing Tools: Using BPF Superpowers
プログラマ目線から見たRDMAのメリットと その応用例について
Linux Profiling at Netflix
Open source apm scouter를 통한 관제 관리 jadecross 정환열 수석
Linux BPF Superpowers
Blazing Performance with Flame Graphs
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Velocity 2015 linux perf tools
10分で分かるLinuxブロックレイヤ
Scale Kubernetes to support 50000 services
Cilium - Fast IPv6 Container Networking with BPF and XDP
ゼロからはじめるKVM超入門
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
BPF: Tracing and more
5 Steps to PostgreSQL Performance
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
Machine Learning Teams - Full Stack Deep Learning
UnboundとDNSSEC(OSC2011 Tokyo/Spring)
The Linux Block Layer - Built for Fast Storage
Ad

Similar to JavaOne 2015 Java Mixed-Mode Flame Graphs (20)

PDF
Using Flame Graphs
PDF
Java in flames
PDF
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
PDF
FreeBSD 2014 Flame Graphs
PDF
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
PDF
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
PDF
Monitorama 2015 Netflix Instance Analysis
PPTX
Using the big guns: Advanced OS performance tools for troubleshooting databas...
POTX
Performance Tuning EC2 Instances
PDF
Linux Performance Tools
PDF
Deep into your applications, performance & profiling
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Performance Tools 2014
PDF
Java Performance & Profiling
PDF
PDF
LISA2019 Linux Systems Performance
PDF
EuroBSDcon 2017 System Performance Analysis Methodologies
PDF
Linux Perf Tools
PDF
MeetBSD2014 Performance Analysis
Using Flame Graphs
Java in flames
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
FreeBSD 2014 Flame Graphs
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Monitorama 2015 Netflix Instance Analysis
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Performance Tuning EC2 Instances
Linux Performance Tools
Deep into your applications, performance & profiling
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Tools 2014
Java Performance & Profiling
LISA2019 Linux Systems Performance
EuroBSDcon 2017 System Performance Analysis Methodologies
Linux Perf Tools
MeetBSD2014 Performance Analysis
Ad

More from Brendan Gregg (20)

PDF
YOW2021 Computing Performance
PDF
IntelON 2021 Processor Benchmarking
PDF
Performance Wins with eBPF: Getting Started (2021)
PDF
Systems@Scale 2021 BPF Performance Getting Started
PDF
Computing Performance: On the Horizon (2021)
PDF
BPF Internals (eBPF)
PDF
Performance Wins with BPF: Getting Started
PDF
YOW2020 Linux Systems Performance
PDF
re:Invent 2019 BPF Performance Analysis at Netflix
PDF
UM2019 Extended BPF: A New Type of Software
PDF
LPC2019 BPF Tracing Tools
PDF
LSFMM 2019 BPF Observability
PDF
YOW2018 CTO Summit: Working at netflix
PDF
eBPF Perf Tools 2019
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
BPF Tools 2017
PDF
NetConf 2018 BPF Observability
PDF
FlameScope 2018
PDF
ATO Linux Performance 2018
PDF
Linux Performance 2018 (PerconaLive keynote)
YOW2021 Computing Performance
IntelON 2021 Processor Benchmarking
Performance Wins with eBPF: Getting Started (2021)
Systems@Scale 2021 BPF Performance Getting Started
Computing Performance: On the Horizon (2021)
BPF Internals (eBPF)
Performance Wins with BPF: Getting Started
YOW2020 Linux Systems Performance
re:Invent 2019 BPF Performance Analysis at Netflix
UM2019 Extended BPF: A New Type of Software
LPC2019 BPF Tracing Tools
LSFMM 2019 BPF Observability
YOW2018 CTO Summit: Working at netflix
eBPF Perf Tools 2019
YOW2018 Cloud Performance Root Cause Analysis at Netflix
BPF Tools 2017
NetConf 2018 BPF Observability
FlameScope 2018
ATO Linux Performance 2018
Linux Performance 2018 (PerconaLive keynote)

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

JavaOne 2015 Java Mixed-Mode Flame Graphs

  • 1. Java Mixed-Mode Flame Graphs Brendan Gregg Senior Performance Architect Oct  2015  
  • 2. Understanding Java CPU usage quickly and completely
  • 3. Quickly •  Via SSH and open source tools (covered in this talk) •  Or using Netflix Vector GUI (also open source): 1.  Observe high CPU usage 2.  Generate a flame graph
  • 4. Java Mixed-Mode Flame Graph via Linux perf_events: Completely Java JVM Kernel GC
  • 5. Messy House Fallacy •  Don't overlook system code: kernel, libraries, etc. Fallacy:  my  code  is  a  mess,  I  bet  yours  is   immaculate,  therefore  the  bug  must  be  mine     Reality:  everyone's  code  is  terrible  and  buggy  
  • 7. •  Over 60 million subscribers –  Just launched in Spain! •  AWS EC2 Linux cloud •  FreeBSD CDN •  Awesome place to work
  • 8. Cloud •  Tens of thousands of AWS EC2 instances •  Mostly running Java applications (Oracle JVM) Linux  (usually  Ubuntu)   Java  (JDK  8)   Tomcat  GC  and   thread   dump   logging   hystrix,  metrics  (Servo),   health  check   OpMonal  Apache,   memcached,  Node.js,   …   Atlas,  S3  log  rotaMon,   sar,  Trace,  perf,  stap,   perf-­‐tools   Vector,  pcp   ApplicaMon  war  files,   plaYorm,  base  servlet  
  • 9. Why we need CPU profiling •  Improving performance –  Identify tuning targets –  Incident response –  Non-regression testing –  Software evaluations –  CPU workload characterization •  Cost savings –  ASGs often scale on load average (CPU), so CPU usage is proportional to cost Instance   Instance   Instance   Scaling  Policy   loadavg,  latency,  …     CloudWatch,  Servo   Auto  Scaling   Group  
  • 10. The Problem with Profilers
  • 12. Java Profilers •  Visibility –  Java method execution –  Object usage –  GC logs –  Custom Java context •  Typical problems: –  Sampling often happens at safety/yield points (skew) –  Method tracing has massive observer effect –  Misidentifies RUNNING as on-CPU (e.g., epoll) –  Doesn't include or profile GC or JVM CPU time –  Tree views not quick (proportional) to comprehend •  Inaccurate (skewed) and incomplete profiles
  • 14. System Profilers •  Visibility –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Typical problems (x86): –  Stacks missing for Java –  Symbols missing for Java methods •  Other architectures (e.g., SPARC) have fared better •  Profile everything except Java
  • 15. Workaround •  Capture both Java and system profiles, and examine side by side •  An improvement, but Java context is often crucial for interpreting system profiles Java System
  • 16. Java Mixed-Mode Flame Graph Solution Java JVM Kernel GC
  • 17. Solution •  Fix system profiling –  Only way to see it all •  Visibility is everything: –  Java methods –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Minor Problems: –  0-3% CPU overhead to enable frame pointers (usually <1%). –  Symbol dumps can consume a burst of CPU •  Complete and accurate (asynchronous) profiling Java JVM Kernel GC
  • 18. Simple Production Example 1.  Poor performance, and one CPU at 100% 2.  perf_events flame graph shows JVM stuck compiling
  • 19. Another System Example Exception handling consuming CPU
  • 21. Exonerating The System •  From last week: -  Frequent thread creation/ destruction assumed to be consuming CPU resources. Recode application? -  A flame graph quantified this CPU time: near zero -  Time mostly other Java methods
  • 24. CPU Profiling A B block interrupt on-CPU off-CPU A B A A B A syscall time •  Record stacks at a timed interval: simple and effective –  Pros: Low (deterministic) overhead –  Cons: Coarse accuracy, but usually sufficient stack samples: A
  • 25. Stack Traces •  A code path snapshot. e.g., from jstack(1): $ jstack 1819 […] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE at Func_abc.func_c(Func_abc.java:6) at Func_abc.func_b(Func_abc.java:16) at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.paren running codepath start
  • 26. System Profilers •  Linux –  perf_events (aka "perf") •  Oracle Solaris –  DTrace •  OS X –  Instruments •  Windows –  XPerf •  And many others…
  • 27. Linux perf_events •  Standard Linux profiler –  Provides the perf command (multi-tool) –  Usually pkg added by linux-tools-common, etc. •  Features: –  Timer-based sampling –  Hardware events –  Tracepoints –  Dynamic tracing •  Can sample stacks of (almost) everything on CPU –  Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools)
  • 28. perf record Profiling •  Stack profiling on all CPUs at 99 Hertz, then dump: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf script […] bash 13204 cpu-clock: 459c4c dequote_string (/root/bash-4.3/bash) 465c80 glob_expand_word_list (/root/bash-4.3/bash) 466569 expand_word_list_internal (/root/bash-4.3/bash) 465a13 expand_words (/root/bash-4.3/bash) 43bbf7 execute_simple_command (/root/bash-4.3/bash) 435f16 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 43a771 execute_while_or_until (/root/bash-4.3/bash) 43a636 execute_while_command (/root/bash-4.3/bash) 436129 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 420cd5 reader_loop (/root/bash-4.3/bash) 41ea58 main (/root/bash-4.3/bash) 7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so) [… ~47,000 lines truncated …] one stack sample
  • 29. perf report Summary •  Generates a call tree and combines samples: # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …] call tree summary
  • 31. perf report Verbosity •  Despite summarizing, output is still verbose # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …]
  • 33. … as a Flame Graph
  • 34. Flame Graphs •  Flame Graphs: –  x-axis: alphabetical stack sort, to maximize merging –  y-axis: stack depth –  color: random (default), or a dimension •  Currently made from Perl + SVG + JavaScript –  Multiple d3 versions are being developed •  Easy to get working –  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html –  Above commands are Linux; see URL for other OSes git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
  • 35. Linux perf_events Workflow perf stat perf record perf report perf script count events capture stacks text UI dump profile stackcollapse-perf.pl flamegraph.pl perf.data   flame graph visualization perf list list events Typical Workflow
  • 36. Flame Graph Interpretation a() b() h() c() d() e() f() g() i()
  • 37. Flame Graph Interpretation (1/3) Top edge shows who is running on-CPU, and how much (width) a() b() h() c() d() e() f() g() i()
  • 38. Flame Graph Interpretation (2/3) Top-down shows ancestry e.g., from g(): h() d() e() i() a() b() c() f() g()
  • 39. Flame Graph Interpretation (3/3) a() b() h() c() d() e() f() g() i() Widths are proportional to presence in samples e.g., comparing b() to h() (incl. children)
  • 40. Flame Graph Colors •  Randomized by default •  Can be used as a dimension. e.g.: –  Mixed-mode flame graphs –  Differential flame graphs –  Search
  • 41. Mixed-Mode Flame Graphs •  Hues: –  green == Java –  red == system –  yellow == C++ •  Intensity randomized to differentiate frames –  Or hashed based on function name Java JVM Kernel Mixed-Mode
  • 42. Differential Flame Graphs •  Hues: –  red == more samples –  blue == less samples •  Intensity shows the degree of difference •  Used for comparing two profiles •  Also used for showing other metrics: e.g., CPI Differential more less
  • 43. Flame Graph Search •  Color: magenta to show matched frames search button
  • 44. Flame Charts •  Flame charts: x-axis is time •  Flame graphs: x-axis is population (maximize merging) •  Final note: these are useful, but are not flame graphs
  • 46. System Profiling Java on x86 •  For example, using Linux perf •  The stacks are 1 or 2 levels deep, and have junk values # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) […]
  • 47. … as a Flame Graph Broken Java stacks (missing frame pointer)
  • 48. Why Stacks are Broken •  On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose •  This "compiler optimization" breaks (simple) stack walking •  Once upon a time, x86 had fewer registers, and this made much more sense •  gcc provides -fno-omit-frame-pointer to avoid doing this, but the JVM had no such option…
  • 49. Fixing Stack Walking Possibilities: A.  Fix frame pointer-based stack walking (the default) –  Pros: simple, supported by many tools –  Cons: might cost a little extra CPU B.  Use a custom walker (likely needing kernel support) –  Pros: full stack walking (incl. inlining) & arguments –  Cons: custom kernel code, can cost more CPU when in use C.  Try libunwind and DWARF –  Even feasible with JIT? Our current preference is (A)
  • 50. Hacking OpenJDK (1/2) •  As a proof of concept, I hacked hotspot to support an x86_64 frame pointer --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 … +++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 … @@ -166,10 +166,9 @@ // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ ) // -// Class for all pointer registers (including RSP) +// Class for all pointer registers (including RSP, excluding RBP) reg_class any_reg(RAX, RAX_H, RDX, RDX_H, - RBP, RBP_H, RDI, RDI_H, RSI, RSI_H, RCX, RCX_H, [...] Remove RBP from register pools
  • 51. Hacking OpenJDK (2/2) •  We used this patched version successfully for some limited (and urgent) performance analysis --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04… +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 … @@ -5236,6 +5236,7 @@ // We always push rbp, so that on return to interpreter rbp, will be // restored correctly and we can correct the stack. push(rbp); + mov(rbp, rsp); // Remove word for ebp framesize -= wordSize; --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … [...] Fix x86_64 function prologues
  • 52. -XX:+PreserveFramePointer •  We shared our patch publicly –  See "A hotspot patch for stack profiling (frame pointer)" on the hotspot complier dev mailing list –  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8, and the -XX:+PreserveFramePointer option •  Zoltán Majó (Oracle) took this on, rewrote it, and it is now: –  In JDK 9 –  In JDK 8 update 60 build 19 –  Thanks to Zoltán, Oracle, and the other hotspot engineers for helping get this done! •  It might cost 0 – 3% CPU, depending on workload
  • 53. Broken Java Stacks (before) •  Check with "perf script" to see stack samples •  These are 1 or 2 levels deep (junk values) # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f417908c194 [unknown] (/tmp/perf-4458.map) […]
  • 54. Fixed Java Stacks •  With -XX: +PreserveFramePointer stacks are full, and go all the way to start_thread() •  This is what the CPUs are really running: inlined frames are not present # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
  • 55. Fixed Stacks Flame Graph Java stacks (but no symbols)
  • 56. Stacks & Inlining •  Frames may be missing (inlined) •  Disabling inlining: –  -XX:-Inline –  Many more Java frames –  Can be 80% slower! •  May not be necessary –  Inlined flame graphs often make enough sense –  Or tune -XX:MaxInlineSize and -XX:InlineSmallCode a little to reveal more frames •  Can even improve performance! •  perf-map-agent (next) has experimental un-inline support No inlining
  • 58. Missing Symbols 12.06% 62 sed sed [.] re_search_internal | --- re_search_internal | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main 71.79% 334 sed sed [.] 0x000000000001afc1 | |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 broken not broken •  Missing symbols may show up as hex; e.g., Linux perf:
  • 59. Fixing Symbols •  For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map, and warns if it doesn't exist •  This file can be created by a Java agent # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […]
  • 60. Java Symbols for perf •  perf-map-agent –  https://github.com/jrudolph/perf-map-agent –  Agent attaches and writes the /tmp file on demand (previous versions attached on Java start, wrote continually) –  Thanks Johannes Rudolph! •  Use of a /tmp symbol file –  Pros: simple, can be low overhead (snapshot on demand) –  Cons: stale symbols •  Using a symbol logger with perf instead –  Patch by Stephane Eranian currently being discussed on lkml; see "perf: add support for profiling jitted code"
  • 61. Java Mixed-Mode Flame Graph Stacks & Symbols Java JVM Kernel GC
  • 64. Instructions 1.  Check Java version 2.  Install perf-map-agent 3.  Set -XX:+PreserveFramePointer 4.  Profile Java 5.  Dump symbols 6.  Generate Mixed-Mode Flame Graph Note these are unsupported: use at your own risk Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
  • 65. 1. Check Java Version •  Need JDK8u60 or better –  for -XX:+PreserveFramePointer •  Upgrade Java if necessary $ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
  • 66. 2. Install perf-map-agent •  Check https://github.com/jrudolph/perf-map-agent for the latest instructions. e.g.: $ sudo bash # apt-get install -y cmake # export JAVA_HOME=/usr/lib/jvm/java-8-oracle # cd /usr/lib/jvm # git clone --depth=1 https://github.com/jrudolph/perf-map-agent # cd perf-map-agent # cmake . # make
  • 67. 3. Set -XX:+PreserveFramePointer •  Needs to be set on Java startup •  Check it is enabled (on Linux): $ ps wwp `pgrep –n java` | grep PreserveFramePointer
  • 68. 4. Profile Java •  Using Linux perf_events to profile all processes, at 99 Hertz, for 30 seconds (as root): •  Just profile one PID (broken on some older kernels): •  These create a perf.data file # perf record -F 99 -a -g -- sleep 30 # perf record -F 99 -p PID -g -- sleep 30
  • 69. 5. Dump Symbols •  See perf-map-agent docs for updated usage •  e.g., as the same user as java: •  perf-map-agent contains helper scripts. I wrote my own: –  https://github.com/brendangregg/Misc/blob/master/java/jmaps •  Dump symbols quickly after perf record to minimize stale symbols. How I do it: $ cd /usr/lib/jvm/perf-map-agent/out $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce PID # perf record -F 99 -a -g -- sleep 30; jmaps
  • 70. 6. Generate a Mixed-Mode Flame Graph •  Using my FlameGraph software: –  perf script reads perf.data with /tmp/*.map –  out.stacks01 is an intermediate file; can be handy to keep •  Finally open flame01.svg in a browser •  Check for newer flame graph implementations (e.g., d3) # perf script > out.stacks01 # git clone --depth=1 https://github.com/brendangregg/FlameGraph # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
  • 73. Netflix Vector Near real-time, per-second metrics Flame Graphs Select Metrics Select Instance
  • 74. Netflix Vector •  Open source, on-demand, instance analysis tool –  https://github.com/netflix/vector •  Shows various real-time metrics •  Flame graph support currently in development –  Automating previous steps –  Using it internally already –  Also developing a new d3 front end
  • 77. Linux perf_events Coverage … all possible with Java stacks
  • 78. Advanced Flame Graphs •  Examples: –  Page faults –  Context switches –  Disk I/O requests –  TCP events –  CPU cache misses –  CPI •  Any event issued in synchronous Java context
  • 79. Synchronous Java Context •  Java thread still on-CPU, and event is directly triggered •  Examples: –  Disk I/O requests issued directly by Java à yes •  direct reads, sync writes, page faults –  Disk I/O completion interrupts à no* –  Disk I/O requests triggered async, e.g., readahead à no* * can be made yes by tracing and associating context
  • 80. Page Faults •  Show what triggered main memory (resident) to grow: •  "fault" as (physical) main memory is allocated on- demand, when a virtual page is first populated •  Low overhead tool to solve some types of memory leak # perf record -e page-faults -p PID -g -- sleep 120 RES column in top(1) grows because
  • 81. Page Fault Flame Graph GC Java code epoll
  • 82. Context Switches •  Show why Java blocked and stopped running on-CPU: •  Identifies locks, I/O, sleeps –  If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). •  e.g., was used to understand framework differences: # perf record -e context-switches -p PID -g -- sleep 5 vs rxNetty Tomcat
  • 83. Context Switch Flame Graph (1/2) rxNetty epoll futex
  • 84. Context Switch Flame Graph (2/2) Tomcat sys_poll futex
  • 85. Disk I/O Requests •  Shows who issued disk I/O (sync reads & writes): •  e.g.: page faults in GC? This JVM has swapped out!: # perf record -e block:block_rq_insert -a -g -- sleep 60 GC
  • 86. TCP Events •  TCP transmit, using dynamic tracing: •  Note: can be high overhead for high packet rates –  For the current perf trace, dump, post-process cycle •  Can also trace TCP connect & accept (lower overhead) •  TCP receive is async –  Could trace via socket read # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks # perf probe --del tcp_sendmsg
  • 87. TCP Send Flame Graph kernel Java JVM Only one code-path taken in this example ab (client process)
  • 88. CPU Cache Misses •  In this example, sampling via Last Level Cache loads: •  -c is the count (samples once per count) •  Use other CPU counters to sample hits, misses, stalls # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
  • 89. One Last Example •  Back to a mixed-mode CPU flame graph •  What else can we show with color?
  • 90. CPI Flame Graph •  Cycles Per Instruction! –  red == instruction heavy –  blue == cycle heavy (likely mem stall cycles) zoomed:
  • 91. Links & References •  Flame Graphs –  http://www.brendangregg.com/flamegraphs.html –  http://techblog.netflix.com/2015/07/java-in-flames.html –  http://techblog.netflix.com/2014/11/nodejs-in-flames.html –  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html •  Linux perf_events –  https://perf.wiki.kernel.org/index.php/Main_Page –  http://www.brendangregg.com/perf.html –  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html •  Netflix Vector –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  JDK tickets –  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465 –  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945 •  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
  • 92. Thanks •  Questions? •  http://techblog.netflix.com •  http://slideshare.net/brendangregg •  http://www.brendangregg.com •  bgregg@netflix.com •  @brendangregg Oct  2015