Hadoop cluster performance profiler

HADOOP CLUSTER
PERFORMANCE
PROFILING
Ihor Bobak
Lead Software Engineer, EPAM Systems
AUGUST 27, 2015

CONTENTS
Covered topics:
• What is profiling? How do profilers work?
• What problems can affect performance?
• How to profile a distributed application?
• Gathering, storing and analysis of stack traces
• Memory analysis
• Use Case
• Alternative approaches to profiling

3
WHAT IS A PROFILER?
Profiler is a tool to look what parts of your app is working slowly.
VisualVM YourKit

4
HOW DO PROFILERS WORK?
• Instrumenting: adding extra bytecode to your
methods for recording when they’re called and
how long they execute.
• Sampling: taking dumps of all the threads
periodically in order to understand how much
CPU time each method takes.

5
DIFFUCULTIES WITH A CLUSTER
This is a typical mapreduce application running on a Hadoop cluster.
All blue boxes are separate JVM processes running on different
machines. Question: how can we profile a distributed Java app?
1. How to attach to a process
running on another host?
2. How to track the appearance
of new processes?
3. How to gather profiling data?
4. How to analyze this vast
amount of data?

6
WHY DO WE NEED A CLUSTER PROFILER?
Answer: we need a profiler to get more performance.
Hadoop principle is next:
“If you want more performance, add more hardware”.
This is a truth. But this is not the only truth.
Another truth is: there are problems that are related to
ALL applications (both distributed and local).

7
PROBLEM 1: NOT OPTIMAL CODE
public static void QuickSort(int[] a, int x, int y){
int pivot = (x+y)/2;
int apivot = a[pivot];
int i = x;
int j = y;
while (i <= j){
while (a[i] < apivot) i++;
while (a[j] > apivot) j--;
if (i <= j){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
i++;
j--;
}
}
if (x < j)
QuickSort(a, x, j);
if (i < y)
QuickSort(a, i, y);
}
public static void StupidSort(int[] a){
for (int i = 0; i < a.length - 1; ++i)
for (int j = i + 1; j < a.length; ++j)
if (a[i] > a[j]){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
}
“Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N))
This is a simple example of different algorithm solving the same tasks:

8
PROBLEM 2: BAD CODE/DATA
• Repeatedly doing the same unnecessary actions
Example: re-reading the configuration file or a database table again and again during every operation (although we
could cache it in the memory).
• Wrong usage of someone’s code/libraries/binaries
Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC-
mode. The first one is faster.
• Usage of wrong libraries
Example: https://powercollections.codeplex.com/workitem/16950
I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet.
• Absense of indexes in a database
Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to
make keys/indexes
• Bugs in famous libraries/frameworks
Example: http://ihorbobak.com/index.php/2015/06/03/spark-sql-bad-performance/ problem with A->B->C tables
join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.

9
PROBLEM 3: HARDWARE TROUBLES
Two most important problems are:
• Disk problems (slow I/O speed)
• Network problems (slow bandwidths, packets
loss)

10
CLUSTER PROFILER ARCHITECTURE
Java Process
“Injected” code
which does stacktrace
sampling
Passing stacktraces
each 10 seconds
though HTTP
A set of Python/Perl
scripts to get
visualizations
Visualization in the
form of flame graphs
This is applicable to any java process:
mapper, reducer, etc., and it is applicable
not only Hadoop: it can be Spark RDD
code, Java web app code, etc.

11
HOW JAVA AGENT WORKS?
• Agent is bound to a java process by specifying -javaagent parameter, e.g.
java –javaagent:/path/agent.jar=parameters MainClass
or by overriding _JAVA_OPTIONS like this:
_JAVA_OPTIONS='-javaagent:/path/agent.jar=parameters
• Agent’s jar has a manifest with
PreMain-Class: namespace.TheAgentClass
• “TheAgentClass” has a premain() method that executes before your
main() and does the following:
– Read the parameters of the agent
– Constructs the profiler instances (based on parameters)
– Creates a ScheduledExecutorService (see java.util.concurrent) that does
scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)

12
HOW JAVA AGENT WORKS?
The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a
part of JMX – a technology for monitoring and managing the JVM)
public void profile() {
profileCount++;
try{
for (ThreadInfo thread : getAllRunnableThreads()) {
if (thread.getStackTrace().length > 0) {
String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace());
if (filter.includeStackTrace(traceKey))
traces.increment(traceKey, 1);
}
}
}
catch (OutOfMemoryError ex)
{
// ... skipping code for handling OOM (just for safety)
}
if (profileCount == reportingFrequency) {
profileCount = 0;
recordMethodCounts();
}
}
For more information about JMX read here:
https://docs.oracle.com/javase/tutorial/jmx/index.html

13
STATSD + MY CHANGES
I made a modification of a famous StatsD JVM profiler https://github.com/etsy/statsd-
jvm-profiler
List of my changes:
• Added the jvmName and host tag to each stacktrace;
• Optimized performance in stacktraces collection code;
• Improved stability - added catching of OutOfMemoryException;
• Added statistics to show how many lines and characters we pass to the backend;
• Seriously modified the influxdb_dump.py: now it extracts data into a set of distinct
files - one file for each JVM, each host and a total.
• Added extraction of memory information and rendering it with charts in R
• Added call_tree.py: a script for analysis of the method call trees
• Added some helper scripts.

14
INFLUXD
What is InfluxDB?
It is a time series, metrics, and analytics database.
Targeted at:
gathering metrics (like response times, CPU load), sensor
data, events (like exceptions) and real-time analytics.
Key Features:
• SQL-like query language;
• HTTP(S) API for data ingestion and queries;
• Built-in support for other data protocols such as
collectd;
• Has a CLI and web interface;
• Tag data for fast and efficient queries.

15
Measurements
(analog of tables)
tag keys:values
SQL-like query language
timestamps
Series:
measurement
name
+ tag key-values
+ data
values

16
Schema exploration examples:
• SHOW MEASUREMENTS
shows the list of measurements
• SHOW SERIES FROM /.*cpu.*/
shows the list of series for each measurement whose name matches the
pattern /*.cpu.*/
• SHOW TAG KEYS FROM /.*heap.*/
shows different tag keys from measurements that match pattern
• SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName
shows different tag keys from measurements that match pattern
Data exploration examples:
• SELECT * FROM cpu WHERE host = ‘A’
selects series for “cpu” measurement with tag host=‘A’
• SELECT percentile(value, 95) FROM response_times
WHERE time > now() - 1d
GROUP BY time(1m)
shows the 95th percentile of response times in the last day in 1 minute
interval

17
FLAME GRAPHS
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
Gathered stack traces:
A->B->C
A->B->C->D
A->B->C->D
A->B
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
THE WIDTH OF A BAR MATTERS.
Color doesn’t matter and is selected just to distinguish bars.

18
FLAME GRAPHS
Flame graphs are a visualization of profiled software, allowing the
most frequent code-paths to be identified quickly and accurately.
Invented by Brendann Gregg: http://www.brendangregg.com

19
SEQUENCE OF ACTIONS
Steps to Profile a Cluster:
1. Install InfluxDB on a separate machine visible to all machines of the cluster.
Create a database and a user.
2. Get the agent’s jar file from my blog (or from sources) and put it into
/var/lib at every worker node.
3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘-
javaagent…’ available to all JVM processes.
4. Run your application and get the stacktraces in the InfluxDB. You may
“switch off” the _JAVA_AGENT after this.
5. Get the SVG files (flame graphs) from InfluxDB with the help of
influxdb_dump.py and flamegraph_files.sh and do the analysis.
These steps are described in detail at my blog http://ihorbobak.com

20
LOCATION FOR _JAVA_OPTIONS
_JAVA_OPTIONS='-javaagent:/var/lib/statsd-jvm-profiler-0.8.3-
SNAPSHOT.jar=server=serveraddress,port=8086,reporter=InfluxDBReporter,database=profiler,us
ername=profiler,password=profiler,prefix=value1.value2.valueN,tagMapping=tag1,tag2,tagN'

21
USE CASE WITH A REAL CUSTOMER
The App/Inventory/Environment:
•Our customer has an app that crawls data from a set of sites, parses it
and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM
and 1TB HDD each).
•The app leverages Apache Nutch, Cloudera Hadoop distribution
version 5.3, Hbase, MongoDB and other technologies.
•There is a central Java web app (Java/Tomcat) that uses Nutch which
runs the mapreduce jobs.
The problem:
•The cluster crawls just 100 sites per day; a customer is asking us
“how to make it crawl 10 times more on the same hardware?”

22
FIRST FINDINGS
The first question that arose in my head: what exactly works slowly?
At the beginning I quickly found this: slow are the parts that are I/O intensive.

23
DISK I/O
Then I did I/O monitoring procedures and a series of test of disk speed on nodes.
This is the result of IOPS benchmark
https://github.com/cxcv/iops/blob/master/iops:
512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s)
1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s)
2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s)
16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s)
512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s)
1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s)
512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s)
512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s)
1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s)
Cluster Node My local VM
Cluster node is 10 times slower than a VM running on my development
workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)

24
FETCHER MAPREDUCE JOB
% of CPU time:
15% - HTML parsing
15% - Hadoop
framework
initialization code
7% - HDFS
initialization code
22% - reducer code
(BAD NEWS HERE)
18% - reading
Hadoop XML config
files
23% - real job

25
DRILL DOWN INTO THE REDUCER
org.apache.hadoop.hbase.
catalog.MetatataReader.
fullScan()
org.apache.avro.
Schema$Parser.parse()
ending with ZipFile.read,
ZipFile.getEntry(), etc.
org.apache.hadoop.hbase.
client.HConnectionManager.
createConnection()
Creating a record writer
Parsing
avro
schema
Fetcher
Reducer
.run()

26
DRILL DOWN INTO THE RECORD WRITER
This is Gora
library code
Most observable
function calls on
top are:
java.util.zip.*
FileInputStream*
FileOutputStream*

27
REPEATING SLOW PARTS IN ALL JOBS

28
INEFFECTIVE MEMORY MANAGEMENT
Most of Java processes used
significantly less memory
than they were initially
assigned.
Legend:
• init - the initial amount of memory that the
JVM requests from the OS during startup;
• used - the amount of memory currently
used;
• Committed - the amount of memory that is
guaranteed to be available for use by the
Java virtual machine;
• Max - represents the maximum amount of
memory (in bytes) that can be used for
memory management.
A memory allocation may fail if it attempts
to increase the used memory such that used
> committed even if used <= max would still
be true

29
PROBLEMS AND NEXT STEPS
1) Gora + HBase
Reasons: Bad code in Gora (too many metadata full table scans)
Actions:
• check Gora’s configuration, dive into the code to find out why it does full scan
• try Cassandra instead of HBase
2) Hadoop Framework parts, in particular:
• HDFS initialization in mapreduce jobs (slow communication with Namenode)
• Reading configuration files (it is done with Xerces library ).
Possible Reasons:
• Bad I/O speed and bad network speed.
• There can be some parameterizing of XML parsing of config files that we’re not aware of.
Actions:
• fix the hardware issues.
• Search for why Hadoop XML config parsing may be so slow
• Check namenode memory usage

30
OTHER METHOD OF GETTING STACKTRACE
Another method to get stack traces is Linux’s perf_events:
perf record -F 99 -g -p PID
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
Perf monitors:
• Hardware events (e.g. level 2 cache
misses);
• Software events (e.g. CPU migrations)
• Tracepoint events (e.g. filesystem I/O,
TCP events)
Perf can also do
• Sampling: collection of snapshots at some
frequency (by timer)
• Dynamic tracing: instrumenting code to
create events in any location (using
kprobes or uprobes frameworks)
For more details see: http://www.brendangregg.com/perf.html

31
PERF vs. JAVA AGENT
Advantages of perf over java agent:
• low overhead when getting stack traces;
• combining user calls (Java) and kernel calls in one flame graph.
• Will 100% catch all Java methods (no matter that JVM may
exclude safepoint checks from hot methods)
(http://chriskirk.blogspot.com/2013/09/what-is-java-safepoint.html - a good
explanation about safepoints).
Disadvantages of perf:
• Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack
walking in OpenJDK – done by Netflix and Twitter)
• Doesn’t see Java symbols (hex numbers instead; special agent needed to add
symbols https://github.com/jrudolph/perf-map-agent )
• Permissions must be configured to symbol files
• It is necessary to develop a service which will launch perf, get stacktraces and
pass them to a server.

32
PERF vs. JAVA AGENT
And…. it happens that Netflix’s product is open sourced…

33
CREDITS
Andrew Johnson
Software Engineer at Etsy
Previously: Explorys, Inc.
https://www.linkedin.com/in/ajsquared
Brendann Gregg
Senior Performance Architect at Netflix
Previously: Joyent, Oracle, Sun Microsystems
http://www.brendangregg.com/index.html

34
BLOGS/ARTICLES
Blogs:
• My blog article
http://ihorbobak.com/index.php/2015/08/05/cluster-profiling/
• Etsy’s blog about JVM Profiler
https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/
https://codeascraft.com/2015/05/12/four-months-of-statsd-jvm-profiler-a-retrospective/
• Brendan Gregg’s blog
http://www.brendangregg.com/blog/index.html
Source code:
• My modification of StatsD JVM Profiler
https://github.com/ibobak/statsd-jvm-profiler
• Original Etsy’s StatsD JVM Profiler
https://github.com/etsy/statsd-jvm-profiler
• Brendan Gregg’s FlameGraph
https://github.com/brendangregg/FlameGraph
Manuals:
• InfluxDB Docs
https://influxdb.com/docs/v0.9/introduction/overview.html
• Overview of the JMX Technology
https://docs.oracle.com/javase/tutorial/jmx/overview/index.html
• JVM Tool Interface
http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#starting

35
BOOKS / VIDEOS
• Systems Performance: Enterprise and the Cloud
by Brendan Gregg
http://www.amazon.com/Systems-Performance-
Enterprise-Brendan-Gregg/dp/0133390098
• Blazing Performance with Flame Graphs
by Brendan Gregg
https://www.youtube.com/watch?v=nZfNehCzGdw
• Linux profiling at Netflix
by Brendan Gregg
https://www.youtube.com/watch?v=_Ik8oiQvWgo
• Profiling Java in Production
by Kaushik Srenevasan, Twitter University
https://www.youtube.com/watch?v=Yg6_ulhwLw0

36
Contacts:
Ihor Bobak
E-mail: ibobak@gmail.com
Skype: ibobak

Hadoop cluster performance profiler

More Related Content

What's hot (20)

Similar to Hadoop cluster performance profiler (20)

Recently uploaded (20)

Hadoop cluster performance profiler

Editor's Notes