SlideShare a Scribd company logo
HADOOP CLUSTER
PERFORMANCE
PROFILING
Ihor Bobak
Lead Software Engineer, EPAM Systems
AUGUST 27, 2015
CONTENTS
Covered topics:
• What is profiling? How do profilers work?
• What problems can affect performance?
• How to profile a distributed application?
• Gathering, storing and analysis of stack traces
• Memory analysis
• Use Case
• Alternative approaches to profiling
3
WHAT IS A PROFILER?
Profiler is a tool to look what parts of your app is working slowly.
VisualVM YourKit
4
HOW DO PROFILERS WORK?
• Instrumenting: adding extra bytecode to your
methods for recording when they’re called and
how long they execute.
• Sampling: taking dumps of all the threads
periodically in order to understand how much
CPU time each method takes.
5
DIFFUCULTIES WITH A CLUSTER
This is a typical mapreduce application running on a Hadoop cluster.
All blue boxes are separate JVM processes running on different
machines. Question: how can we profile a distributed Java app?
1. How to attach to a process
running on another host?
2. How to track the appearance
of new processes?
3. How to gather profiling data?
4. How to analyze this vast
amount of data?
6
WHY DO WE NEED A CLUSTER PROFILER?
Answer: we need a profiler to get more performance.
Hadoop principle is next:
“If you want more performance, add more hardware”.
This is a truth. But this is not the only truth.
Another truth is: there are problems that are related to
ALL applications (both distributed and local).
7
PROBLEM 1: NOT OPTIMAL CODE
public static void QuickSort(int[] a, int x, int y){
int pivot = (x+y)/2;
int apivot = a[pivot];
int i = x;
int j = y;
while (i <= j){
while (a[i] < apivot) i++;
while (a[j] > apivot) j--;
if (i <= j){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
i++;
j--;
}
}
if (x < j)
QuickSort(a, x, j);
if (i < y)
QuickSort(a, i, y);
}
public static void StupidSort(int[] a){
for (int i = 0; i < a.length - 1; ++i)
for (int j = i + 1; j < a.length; ++j)
if (a[i] > a[j]){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
}
“Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N))
This is a simple example of different algorithm solving the same tasks:
8
PROBLEM 2: BAD CODE/DATA
• Repeatedly doing the same unnecessary actions
Example: re-reading the configuration file or a database table again and again during every operation (although we
could cache it in the memory).
• Wrong usage of someone’s code/libraries/binaries
Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC-
mode. The first one is faster.
• Usage of wrong libraries
Example: https://powercollections.codeplex.com/workitem/16950
I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet.
• Absense of indexes in a database
Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to
make keys/indexes
• Bugs in famous libraries/frameworks
Example: http://ihorbobak.com/index.php/2015/06/03/spark-sql-bad-performance/ problem with A->B->C tables
join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.
9
PROBLEM 3: HARDWARE TROUBLES
Two most important problems are:
• Disk problems (slow I/O speed)
• Network problems (slow bandwidths, packets
loss)
10
CLUSTER PROFILER ARCHITECTURE
Java Process
“Injected” code
which does stacktrace
sampling
Passing stacktraces
each 10 seconds
though HTTP
A set of Python/Perl
scripts to get
visualizations
Visualization in the
form of flame graphs
This is applicable to any java process:
mapper, reducer, etc., and it is applicable
not only Hadoop: it can be Spark RDD
code, Java web app code, etc.
11
HOW JAVA AGENT WORKS?
• Agent is bound to a java process by specifying -javaagent parameter, e.g.
java –javaagent:/path/agent.jar=parameters MainClass
or by overriding _JAVA_OPTIONS like this:
_JAVA_OPTIONS='-javaagent:/path/agent.jar=parameters
• Agent’s jar has a manifest with
PreMain-Class: namespace.TheAgentClass
• “TheAgentClass” has a premain() method that executes before your
main() and does the following:
– Read the parameters of the agent
– Constructs the profiler instances (based on parameters)
– Creates a ScheduledExecutorService (see java.util.concurrent) that does
scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)
12
HOW JAVA AGENT WORKS?
The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a
part of JMX – a technology for monitoring and managing the JVM)
public void profile() {
profileCount++;
try{
for (ThreadInfo thread : getAllRunnableThreads()) {
if (thread.getStackTrace().length > 0) {
String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace());
if (filter.includeStackTrace(traceKey))
traces.increment(traceKey, 1);
}
}
}
catch (OutOfMemoryError ex)
{
// ... skipping code for handling OOM (just for safety)
}
if (profileCount == reportingFrequency) {
profileCount = 0;
recordMethodCounts();
}
}
For more information about JMX read here:
https://docs.oracle.com/javase/tutorial/jmx/index.html
13
STATSD + MY CHANGES
I made a modification of a famous StatsD JVM profiler https://github.com/etsy/statsd-
jvm-profiler
List of my changes:
• Added the jvmName and host tag to each stacktrace;
• Optimized performance in stacktraces collection code;
• Improved stability - added catching of OutOfMemoryException;
• Added statistics to show how many lines and characters we pass to the backend;
• Seriously modified the influxdb_dump.py: now it extracts data into a set of distinct
files - one file for each JVM, each host and a total.
• Added extraction of memory information and rendering it with charts in R
• Added call_tree.py: a script for analysis of the method call trees
• Added some helper scripts.
14
INFLUXD
What is InfluxDB?
It is a time series, metrics, and analytics database.
Targeted at:
gathering metrics (like response times, CPU load), sensor
data, events (like exceptions) and real-time analytics.
Key Features:
• SQL-like query language;
• HTTP(S) API for data ingestion and queries;
• Built-in support for other data protocols such as
collectd;
• Has a CLI and web interface;
• Tag data for fast and efficient queries.
15
Measurements
(analog of tables)
tag keys:values
SQL-like query language
timestamps
Series:
measurement
name
+ tag key-values
+ data
values
16
Schema exploration examples:
• SHOW MEASUREMENTS
shows the list of measurements
• SHOW SERIES FROM /.*cpu.*/
shows the list of series for each measurement whose name matches the
pattern /*.cpu.*/
• SHOW TAG KEYS FROM /.*heap.*/
shows different tag keys from measurements that match pattern
• SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName
shows different tag keys from measurements that match pattern
Data exploration examples:
• SELECT * FROM cpu WHERE host = ‘A’
selects series for “cpu” measurement with tag host=‘A’
• SELECT percentile(value, 95) FROM response_times
WHERE time > now() - 1d
GROUP BY time(1m)
shows the 95th percentile of response times in the last day in 1 minute
interval
17
FLAME GRAPHS
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
Gathered stack traces:
A->B->C
A->B->C->D
A->B->C->D
A->B
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
THE WIDTH OF A BAR MATTERS.
Color doesn’t matter and is selected just to distinguish bars.
18
FLAME GRAPHS
Flame graphs are a visualization of profiled software, allowing the
most frequent code-paths to be identified quickly and accurately.
Invented by Brendann Gregg: http://www.brendangregg.com
19
SEQUENCE OF ACTIONS
Steps to Profile a Cluster:
1. Install InfluxDB on a separate machine visible to all machines of the cluster.
Create a database and a user.
2. Get the agent’s jar file from my blog (or from sources) and put it into
/var/lib at every worker node.
3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘-
javaagent…’ available to all JVM processes.
4. Run your application and get the stacktraces in the InfluxDB. You may
“switch off” the _JAVA_AGENT after this.
5. Get the SVG files (flame graphs) from InfluxDB with the help of
influxdb_dump.py and flamegraph_files.sh and do the analysis.
These steps are described in detail at my blog http://ihorbobak.com
20
LOCATION FOR _JAVA_OPTIONS
_JAVA_OPTIONS='-javaagent:/var/lib/statsd-jvm-profiler-0.8.3-
SNAPSHOT.jar=server=serveraddress,port=8086,reporter=InfluxDBReporter,database=profiler,us
ername=profiler,password=profiler,prefix=value1.value2.valueN,tagMapping=tag1,tag2,tagN'
21
USE CASE WITH A REAL CUSTOMER
The App/Inventory/Environment:
•Our customer has an app that crawls data from a set of sites, parses it
and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM
and 1TB HDD each).
•The app leverages Apache Nutch, Cloudera Hadoop distribution
version 5.3, Hbase, MongoDB and other technologies.
•There is a central Java web app (Java/Tomcat) that uses Nutch which
runs the mapreduce jobs.
The problem:
•The cluster crawls just 100 sites per day; a customer is asking us
“how to make it crawl 10 times more on the same hardware?”
22
FIRST FINDINGS
The first question that arose in my head: what exactly works slowly?
At the beginning I quickly found this: slow are the parts that are I/O intensive.
23
DISK I/O
Then I did I/O monitoring procedures and a series of test of disk speed on nodes.
This is the result of IOPS benchmark
https://github.com/cxcv/iops/blob/master/iops:
512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s)
1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s)
2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s)
4 KiB blocks: 72.3 IO/s, 289.2 KiB/s ( 2.4 Mbit/s)
8 KiB blocks: 69.8 IO/s, 558.7 KiB/s ( 4.6 Mbit/s)
16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s)
32 KiB blocks: 58.2 IO/s, 1.8 MiB/s ( 15.3 Mbit/s)
64 KiB blocks: 54.3 IO/s, 3.4 MiB/s ( 28.5 Mbit/s)
128 KiB blocks: 45.9 IO/s, 5.7 MiB/s ( 48.1 Mbit/s)
256 KiB blocks: 38.7 IO/s, 9.7 MiB/s ( 81.1 Mbit/s)
512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s)
1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s)
2 MiB blocks: 10.3 IO/s, 20.7 MiB/s (173.6 Mbit/s)
4 MiB blocks: 5.7 IO/s, 22.8 MiB/s (191.7 Mbit/s)
8 MiB blocks: 4.8 IO/s, 38.8 MiB/s (325.2 Mbit/s)
16 MiB blocks: 2.0 IO/s, 32.6 MiB/s (273.8 Mbit/s)
32 MiB blocks: 0.8 IO/s, 27.0 MiB/s (226.1 Mbit/s)
512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s)
1 KiB blocks: 1084.7 IO/s, 1.1 MiB/s ( 8.9 Mbit/s)
2 KiB blocks: 836.6 IO/s, 1.6 MiB/s ( 13.7 Mbit/s)
4 KiB blocks: 698.4 IO/s, 2.7 MiB/s ( 22.9 Mbit/s)
8 KiB blocks: 755.7 IO/s, 5.9 MiB/s ( 49.5 Mbit/s)
16 KiB blocks: 909.1 IO/s, 14.2 MiB/s (119.2 Mbit/s)
32 KiB blocks: 784.9 IO/s, 24.5 MiB/s (205.7 Mbit/s)
64 KiB blocks: 747.9 IO/s, 46.7 MiB/s (392.1 Mbit/s)
128 KiB blocks: 593.2 IO/s, 74.2 MiB/s (622.0 Mbit/s)
256 KiB blocks: 441.4 IO/s, 110.4 MiB/s (925.8 Mbit/s)
512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s)
1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s)
2 MiB blocks: 159.1 IO/s, 318.3 MiB/s ( 2.7 Gbit/s)
4 MiB blocks: 103.2 IO/s, 412.6 MiB/s ( 3.5 Gbit/s)
8 MiB blocks: 46.6 IO/s, 372.8 MiB/s ( 3.1 Gbit/s)
16 MiB blocks: 23.4 IO/s, 374.0 MiB/s ( 3.1 Gbit/s)
32 MiB blocks: 11.9 IO/s, 381.9 MiB/s ( 3.2 Gbit/s)
Cluster Node My local VM
Cluster node is 10 times slower than a VM running on my development
workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)
24
FETCHER MAPREDUCE JOB
% of CPU time:
15% - HTML parsing
15% - Hadoop
framework
initialization code
7% - HDFS
initialization code
22% - reducer code
(BAD NEWS HERE)
18% - reading
Hadoop XML config
files
23% - real job
25
DRILL DOWN INTO THE REDUCER
org.apache.hadoop.hbase.
catalog.MetatataReader.
fullScan()
org.apache.avro.
Schema$Parser.parse()
ending with ZipFile.read,
ZipFile.getEntry(), etc.
org.apache.hadoop.hbase.
client.HConnectionManager.
createConnection()
Creating a record writer
Parsing
avro
schema
Fetcher
Reducer
.run()
26
DRILL DOWN INTO THE RECORD WRITER
This is Gora
library code
Most observable
function calls on
top are:
java.util.zip.*
FileInputStream*
FileOutputStream*
27
REPEATING SLOW PARTS IN ALL JOBS
28
INEFFECTIVE MEMORY MANAGEMENT
Most of Java processes used
significantly less memory
than they were initially
assigned.
Legend:
• init - the initial amount of memory that the
JVM requests from the OS during startup;
• used - the amount of memory currently
used;
• Committed - the amount of memory that is
guaranteed to be available for use by the
Java virtual machine;
• Max - represents the maximum amount of
memory (in bytes) that can be used for
memory management.
A memory allocation may fail if it attempts
to increase the used memory such that used
> committed even if used <= max would still
be true
29
PROBLEMS AND NEXT STEPS
1) Gora + HBase
Reasons: Bad code in Gora (too many metadata full table scans)
Actions:
• check Gora’s configuration, dive into the code to find out why it does full scan
• try Cassandra instead of HBase
2) Hadoop Framework parts, in particular:
• HDFS initialization in mapreduce jobs (slow communication with Namenode)
• Reading configuration files (it is done with Xerces library ).
Possible Reasons:
• Bad I/O speed and bad network speed.
• There can be some parameterizing of XML parsing of config files that we’re not aware of.
Actions:
• fix the hardware issues.
• Search for why Hadoop XML config parsing may be so slow
• Check namenode memory usage
30
OTHER METHOD OF GETTING STACKTRACE
Another method to get stack traces is Linux’s perf_events:
perf record -F 99 -g -p PID
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
Perf monitors:
• Hardware events (e.g. level 2 cache
misses);
• Software events (e.g. CPU migrations)
• Tracepoint events (e.g. filesystem I/O,
TCP events)
Perf can also do
• Sampling: collection of snapshots at some
frequency (by timer)
• Dynamic tracing: instrumenting code to
create events in any location (using
kprobes or uprobes frameworks)
For more details see: http://www.brendangregg.com/perf.html
31
PERF vs. JAVA AGENT
Advantages of perf over java agent:
• low overhead when getting stack traces;
• combining user calls (Java) and kernel calls in one flame graph.
• Will 100% catch all Java methods (no matter that JVM may
exclude safepoint checks from hot methods)
(http://chriskirk.blogspot.com/2013/09/what-is-java-safepoint.html - a good
explanation about safepoints).
Disadvantages of perf:
• Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack
walking in OpenJDK – done by Netflix and Twitter)
• Doesn’t see Java symbols (hex numbers instead; special agent needed to add
symbols https://github.com/jrudolph/perf-map-agent )
• Permissions must be configured to symbol files
• It is necessary to develop a service which will launch perf, get stacktraces and
pass them to a server.
32
PERF vs. JAVA AGENT
And…. it happens that Netflix’s product is open sourced…
33
CREDITS
Andrew Johnson
Software Engineer at Etsy
Previously: Explorys, Inc.
https://www.linkedin.com/in/ajsquared
Brendann Gregg
Senior Performance Architect at Netflix
Previously: Joyent, Oracle, Sun Microsystems
http://www.brendangregg.com/index.html
34
BLOGS/ARTICLES
Blogs:
• My blog article
http://ihorbobak.com/index.php/2015/08/05/cluster-profiling/
• Etsy’s blog about JVM Profiler
https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/
https://codeascraft.com/2015/05/12/four-months-of-statsd-jvm-profiler-a-retrospective/
• Brendan Gregg’s blog
http://www.brendangregg.com/blog/index.html
Source code:
• My modification of StatsD JVM Profiler
https://github.com/ibobak/statsd-jvm-profiler
• Original Etsy’s StatsD JVM Profiler
https://github.com/etsy/statsd-jvm-profiler
• Brendan Gregg’s FlameGraph
https://github.com/brendangregg/FlameGraph
Manuals:
• InfluxDB Docs
https://influxdb.com/docs/v0.9/introduction/overview.html
• Overview of the JMX Technology
https://docs.oracle.com/javase/tutorial/jmx/overview/index.html
• JVM Tool Interface
http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#starting
35
BOOKS / VIDEOS
• Systems Performance: Enterprise and the Cloud
by Brendan Gregg
http://www.amazon.com/Systems-Performance-
Enterprise-Brendan-Gregg/dp/0133390098
• Blazing Performance with Flame Graphs
by Brendan Gregg
https://www.youtube.com/watch?v=nZfNehCzGdw
• Linux profiling at Netflix
by Brendan Gregg
https://www.youtube.com/watch?v=_Ik8oiQvWgo
• Profiling Java in Production
by Kaushik Srenevasan, Twitter University
https://www.youtube.com/watch?v=Yg6_ulhwLw0
36
Contacts:
Ihor Bobak
E-mail: ibobak@gmail.com
Skype: ibobak

More Related Content

PDF
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
PDF
Java Performance Tuning
PDF
JCConf 2020 - New Java Features Released in 2020
PPTX
Application Profiling for Memory and Performance
PDF
Why GC is eating all my CPU?
ODP
Java 5 6 Generics, Concurrency, Garbage Collection, Tuning
ODP
Java Garbage Collection, Monitoring, and Tuning
PPTX
Java memory model
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Java Performance Tuning
JCConf 2020 - New Java Features Released in 2020
Application Profiling for Memory and Performance
Why GC is eating all my CPU?
Java 5 6 Generics, Concurrency, Garbage Collection, Tuning
Java Garbage Collection, Monitoring, and Tuning
Java memory model

What's hot (20)

DOCX
Advance Java Programs skeleton
ODP
Java Concurrency, Memory Model, and Trends
PPTX
Java SE 8 - New Features
PPT
CS6270 Virtual Machines - Java Virtual Machine Architecture and APIs
PPTX
Java 8 Feature Preview
PPTX
Concurrency with java
PPTX
Pune-Cocoa: Blocks and GCD
PDF
Java lab-manual
PDF
Java Concurrency in Practice
PDF
New Features Of JDK 7
PDF
ITFT - Java Coding
PPT
Java Tut1
PDF
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
PPTX
Java Micro-Benchmarking
PPT
Java programing considering performance
PDF
02 basic java programming and operators
PDF
Java features. Java 8, 9, 10, 11
PPTX
The Java memory model made easy
PDF
Java 7 New Features
PDF
Identifying memory leaks in Android applications
Advance Java Programs skeleton
Java Concurrency, Memory Model, and Trends
Java SE 8 - New Features
CS6270 Virtual Machines - Java Virtual Machine Architecture and APIs
Java 8 Feature Preview
Concurrency with java
Pune-Cocoa: Blocks and GCD
Java lab-manual
Java Concurrency in Practice
New Features Of JDK 7
ITFT - Java Coding
Java Tut1
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Java Micro-Benchmarking
Java programing considering performance
02 basic java programming and operators
Java features. Java 8, 9, 10, 11
The Java memory model made easy
Java 7 New Features
Identifying memory leaks in Android applications
Ad

Similar to Hadoop cluster performance profiler (20)

PPTX
Profile hadoop apps
PDF
Next-Gen Business Transaction Configuration, Instrumentation, and Java Perfor...
PDF
Rigorous and Multi-tenant HBase Performance
PPTX
Rigorous and Multi-tenant HBase Performance Measurement
PDF
The Art Of Performance Tuning - with presenter notes!
PPTX
Performance eng prakash.sahu
PPTX
DIY Java Profiling
PDF
Shared slides-edbt-keynote-03-19-13
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
Java black box profiling JUG.EKB 2016
PDF
Performance Profiling Tools & Tricks
PDF
Performance Profiling Tools and Tricks
PPTX
Apache big-data-2017-spark-profiling
PDF
Tools and Tips to Diagnose Performance Issues
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
PDF
20140228 fp and_performance
PDF
Introduction to Java Profiling
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PDF
Java Performance & Profiling
PDF
Odp - On demand profiler (ICPE 2018)
Profile hadoop apps
Next-Gen Business Transaction Configuration, Instrumentation, and Java Perfor...
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance Measurement
The Art Of Performance Tuning - with presenter notes!
Performance eng prakash.sahu
DIY Java Profiling
Shared slides-edbt-keynote-03-19-13
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Java black box profiling JUG.EKB 2016
Performance Profiling Tools & Tricks
Performance Profiling Tools and Tricks
Apache big-data-2017-spark-profiling
Tools and Tips to Diagnose Performance Issues
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
20140228 fp and_performance
Introduction to Java Profiling
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Java Performance & Profiling
Odp - On demand profiler (ICPE 2018)
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Hadoop cluster performance profiler

  • 1. HADOOP CLUSTER PERFORMANCE PROFILING Ihor Bobak Lead Software Engineer, EPAM Systems AUGUST 27, 2015
  • 2. CONTENTS Covered topics: • What is profiling? How do profilers work? • What problems can affect performance? • How to profile a distributed application? • Gathering, storing and analysis of stack traces • Memory analysis • Use Case • Alternative approaches to profiling
  • 3. 3 WHAT IS A PROFILER? Profiler is a tool to look what parts of your app is working slowly. VisualVM YourKit
  • 4. 4 HOW DO PROFILERS WORK? • Instrumenting: adding extra bytecode to your methods for recording when they’re called and how long they execute. • Sampling: taking dumps of all the threads periodically in order to understand how much CPU time each method takes.
  • 5. 5 DIFFUCULTIES WITH A CLUSTER This is a typical mapreduce application running on a Hadoop cluster. All blue boxes are separate JVM processes running on different machines. Question: how can we profile a distributed Java app? 1. How to attach to a process running on another host? 2. How to track the appearance of new processes? 3. How to gather profiling data? 4. How to analyze this vast amount of data?
  • 6. 6 WHY DO WE NEED A CLUSTER PROFILER? Answer: we need a profiler to get more performance. Hadoop principle is next: “If you want more performance, add more hardware”. This is a truth. But this is not the only truth. Another truth is: there are problems that are related to ALL applications (both distributed and local).
  • 7. 7 PROBLEM 1: NOT OPTIMAL CODE public static void QuickSort(int[] a, int x, int y){ int pivot = (x+y)/2; int apivot = a[pivot]; int i = x; int j = y; while (i <= j){ while (a[i] < apivot) i++; while (a[j] > apivot) j--; if (i <= j){ int temp = a[i]; a[i] = a[j]; a[j] = temp; i++; j--; } } if (x < j) QuickSort(a, x, j); if (i < y) QuickSort(a, i, y); } public static void StupidSort(int[] a){ for (int i = 0; i < a.length - 1; ++i) for (int j = i + 1; j < a.length; ++j) if (a[i] > a[j]){ int temp = a[i]; a[i] = a[j]; a[j] = temp; } } “Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N)) This is a simple example of different algorithm solving the same tasks:
  • 8. 8 PROBLEM 2: BAD CODE/DATA • Repeatedly doing the same unnecessary actions Example: re-reading the configuration file or a database table again and again during every operation (although we could cache it in the memory). • Wrong usage of someone’s code/libraries/binaries Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC- mode. The first one is faster. • Usage of wrong libraries Example: https://powercollections.codeplex.com/workitem/16950 I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet. • Absense of indexes in a database Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to make keys/indexes • Bugs in famous libraries/frameworks Example: http://ihorbobak.com/index.php/2015/06/03/spark-sql-bad-performance/ problem with A->B->C tables join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.
  • 9. 9 PROBLEM 3: HARDWARE TROUBLES Two most important problems are: • Disk problems (slow I/O speed) • Network problems (slow bandwidths, packets loss)
  • 10. 10 CLUSTER PROFILER ARCHITECTURE Java Process “Injected” code which does stacktrace sampling Passing stacktraces each 10 seconds though HTTP A set of Python/Perl scripts to get visualizations Visualization in the form of flame graphs This is applicable to any java process: mapper, reducer, etc., and it is applicable not only Hadoop: it can be Spark RDD code, Java web app code, etc.
  • 11. 11 HOW JAVA AGENT WORKS? • Agent is bound to a java process by specifying -javaagent parameter, e.g. java –javaagent:/path/agent.jar=parameters MainClass or by overriding _JAVA_OPTIONS like this: _JAVA_OPTIONS='-javaagent:/path/agent.jar=parameters • Agent’s jar has a manifest with PreMain-Class: namespace.TheAgentClass • “TheAgentClass” has a premain() method that executes before your main() and does the following: – Read the parameters of the agent – Constructs the profiler instances (based on parameters) – Creates a ScheduledExecutorService (see java.util.concurrent) that does scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)
  • 12. 12 HOW JAVA AGENT WORKS? The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a part of JMX – a technology for monitoring and managing the JVM) public void profile() { profileCount++; try{ for (ThreadInfo thread : getAllRunnableThreads()) { if (thread.getStackTrace().length > 0) { String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace()); if (filter.includeStackTrace(traceKey)) traces.increment(traceKey, 1); } } } catch (OutOfMemoryError ex) { // ... skipping code for handling OOM (just for safety) } if (profileCount == reportingFrequency) { profileCount = 0; recordMethodCounts(); } } For more information about JMX read here: https://docs.oracle.com/javase/tutorial/jmx/index.html
  • 13. 13 STATSD + MY CHANGES I made a modification of a famous StatsD JVM profiler https://github.com/etsy/statsd- jvm-profiler List of my changes: • Added the jvmName and host tag to each stacktrace; • Optimized performance in stacktraces collection code; • Improved stability - added catching of OutOfMemoryException; • Added statistics to show how many lines and characters we pass to the backend; • Seriously modified the influxdb_dump.py: now it extracts data into a set of distinct files - one file for each JVM, each host and a total. • Added extraction of memory information and rendering it with charts in R • Added call_tree.py: a script for analysis of the method call trees • Added some helper scripts.
  • 14. 14 INFLUXD What is InfluxDB? It is a time series, metrics, and analytics database. Targeted at: gathering metrics (like response times, CPU load), sensor data, events (like exceptions) and real-time analytics. Key Features: • SQL-like query language; • HTTP(S) API for data ingestion and queries; • Built-in support for other data protocols such as collectd; • Has a CLI and web interface; • Tag data for fast and efficient queries.
  • 15. 15 Measurements (analog of tables) tag keys:values SQL-like query language timestamps Series: measurement name + tag key-values + data values
  • 16. 16 Schema exploration examples: • SHOW MEASUREMENTS shows the list of measurements • SHOW SERIES FROM /.*cpu.*/ shows the list of series for each measurement whose name matches the pattern /*.cpu.*/ • SHOW TAG KEYS FROM /.*heap.*/ shows different tag keys from measurements that match pattern • SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName shows different tag keys from measurements that match pattern Data exploration examples: • SELECT * FROM cpu WHERE host = ‘A’ selects series for “cpu” measurement with tag host=‘A’ • SELECT percentile(value, 95) FROM response_times WHERE time > now() - 1d GROUP BY time(1m) shows the 95th percentile of response times in the last day in 1 minute interval
  • 17. 17 FLAME GRAPHS D D C C C B B B B A A A A 0th ms 10th ms 20th ms 30th ms Gathered stack traces: A->B->C A->B->C->D A->B->C->D A->B D D C C C B B B B A A A A 0th ms 10th ms 20th ms 30th ms THE WIDTH OF A BAR MATTERS. Color doesn’t matter and is selected just to distinguish bars.
  • 18. 18 FLAME GRAPHS Flame graphs are a visualization of profiled software, allowing the most frequent code-paths to be identified quickly and accurately. Invented by Brendann Gregg: http://www.brendangregg.com
  • 19. 19 SEQUENCE OF ACTIONS Steps to Profile a Cluster: 1. Install InfluxDB on a separate machine visible to all machines of the cluster. Create a database and a user. 2. Get the agent’s jar file from my blog (or from sources) and put it into /var/lib at every worker node. 3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘- javaagent…’ available to all JVM processes. 4. Run your application and get the stacktraces in the InfluxDB. You may “switch off” the _JAVA_AGENT after this. 5. Get the SVG files (flame graphs) from InfluxDB with the help of influxdb_dump.py and flamegraph_files.sh and do the analysis. These steps are described in detail at my blog http://ihorbobak.com
  • 21. 21 USE CASE WITH A REAL CUSTOMER The App/Inventory/Environment: •Our customer has an app that crawls data from a set of sites, parses it and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM and 1TB HDD each). •The app leverages Apache Nutch, Cloudera Hadoop distribution version 5.3, Hbase, MongoDB and other technologies. •There is a central Java web app (Java/Tomcat) that uses Nutch which runs the mapreduce jobs. The problem: •The cluster crawls just 100 sites per day; a customer is asking us “how to make it crawl 10 times more on the same hardware?”
  • 22. 22 FIRST FINDINGS The first question that arose in my head: what exactly works slowly? At the beginning I quickly found this: slow are the parts that are I/O intensive.
  • 23. 23 DISK I/O Then I did I/O monitoring procedures and a series of test of disk speed on nodes. This is the result of IOPS benchmark https://github.com/cxcv/iops/blob/master/iops: 512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s) 1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s) 2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s) 4 KiB blocks: 72.3 IO/s, 289.2 KiB/s ( 2.4 Mbit/s) 8 KiB blocks: 69.8 IO/s, 558.7 KiB/s ( 4.6 Mbit/s) 16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s) 32 KiB blocks: 58.2 IO/s, 1.8 MiB/s ( 15.3 Mbit/s) 64 KiB blocks: 54.3 IO/s, 3.4 MiB/s ( 28.5 Mbit/s) 128 KiB blocks: 45.9 IO/s, 5.7 MiB/s ( 48.1 Mbit/s) 256 KiB blocks: 38.7 IO/s, 9.7 MiB/s ( 81.1 Mbit/s) 512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s) 1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s) 2 MiB blocks: 10.3 IO/s, 20.7 MiB/s (173.6 Mbit/s) 4 MiB blocks: 5.7 IO/s, 22.8 MiB/s (191.7 Mbit/s) 8 MiB blocks: 4.8 IO/s, 38.8 MiB/s (325.2 Mbit/s) 16 MiB blocks: 2.0 IO/s, 32.6 MiB/s (273.8 Mbit/s) 32 MiB blocks: 0.8 IO/s, 27.0 MiB/s (226.1 Mbit/s) 512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s) 1 KiB blocks: 1084.7 IO/s, 1.1 MiB/s ( 8.9 Mbit/s) 2 KiB blocks: 836.6 IO/s, 1.6 MiB/s ( 13.7 Mbit/s) 4 KiB blocks: 698.4 IO/s, 2.7 MiB/s ( 22.9 Mbit/s) 8 KiB blocks: 755.7 IO/s, 5.9 MiB/s ( 49.5 Mbit/s) 16 KiB blocks: 909.1 IO/s, 14.2 MiB/s (119.2 Mbit/s) 32 KiB blocks: 784.9 IO/s, 24.5 MiB/s (205.7 Mbit/s) 64 KiB blocks: 747.9 IO/s, 46.7 MiB/s (392.1 Mbit/s) 128 KiB blocks: 593.2 IO/s, 74.2 MiB/s (622.0 Mbit/s) 256 KiB blocks: 441.4 IO/s, 110.4 MiB/s (925.8 Mbit/s) 512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s) 1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s) 2 MiB blocks: 159.1 IO/s, 318.3 MiB/s ( 2.7 Gbit/s) 4 MiB blocks: 103.2 IO/s, 412.6 MiB/s ( 3.5 Gbit/s) 8 MiB blocks: 46.6 IO/s, 372.8 MiB/s ( 3.1 Gbit/s) 16 MiB blocks: 23.4 IO/s, 374.0 MiB/s ( 3.1 Gbit/s) 32 MiB blocks: 11.9 IO/s, 381.9 MiB/s ( 3.2 Gbit/s) Cluster Node My local VM Cluster node is 10 times slower than a VM running on my development workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)
  • 24. 24 FETCHER MAPREDUCE JOB % of CPU time: 15% - HTML parsing 15% - Hadoop framework initialization code 7% - HDFS initialization code 22% - reducer code (BAD NEWS HERE) 18% - reading Hadoop XML config files 23% - real job
  • 25. 25 DRILL DOWN INTO THE REDUCER org.apache.hadoop.hbase. catalog.MetatataReader. fullScan() org.apache.avro. Schema$Parser.parse() ending with ZipFile.read, ZipFile.getEntry(), etc. org.apache.hadoop.hbase. client.HConnectionManager. createConnection() Creating a record writer Parsing avro schema Fetcher Reducer .run()
  • 26. 26 DRILL DOWN INTO THE RECORD WRITER This is Gora library code Most observable function calls on top are: java.util.zip.* FileInputStream* FileOutputStream*
  • 27. 27 REPEATING SLOW PARTS IN ALL JOBS
  • 28. 28 INEFFECTIVE MEMORY MANAGEMENT Most of Java processes used significantly less memory than they were initially assigned. Legend: • init - the initial amount of memory that the JVM requests from the OS during startup; • used - the amount of memory currently used; • Committed - the amount of memory that is guaranteed to be available for use by the Java virtual machine; • Max - represents the maximum amount of memory (in bytes) that can be used for memory management. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true
  • 29. 29 PROBLEMS AND NEXT STEPS 1) Gora + HBase Reasons: Bad code in Gora (too many metadata full table scans) Actions: • check Gora’s configuration, dive into the code to find out why it does full scan • try Cassandra instead of HBase 2) Hadoop Framework parts, in particular: • HDFS initialization in mapreduce jobs (slow communication with Namenode) • Reading configuration files (it is done with Xerces library ). Possible Reasons: • Bad I/O speed and bad network speed. • There can be some parameterizing of XML parsing of config files that we’re not aware of. Actions: • fix the hardware issues. • Search for why Hadoop XML config parsing may be so slow • Check namenode memory usage
  • 30. 30 OTHER METHOD OF GETTING STACKTRACE Another method to get stack traces is Linux’s perf_events: perf record -F 99 -g -p PID perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5 Perf monitors: • Hardware events (e.g. level 2 cache misses); • Software events (e.g. CPU migrations) • Tracepoint events (e.g. filesystem I/O, TCP events) Perf can also do • Sampling: collection of snapshots at some frequency (by timer) • Dynamic tracing: instrumenting code to create events in any location (using kprobes or uprobes frameworks) For more details see: http://www.brendangregg.com/perf.html
  • 31. 31 PERF vs. JAVA AGENT Advantages of perf over java agent: • low overhead when getting stack traces; • combining user calls (Java) and kernel calls in one flame graph. • Will 100% catch all Java methods (no matter that JVM may exclude safepoint checks from hot methods) (http://chriskirk.blogspot.com/2013/09/what-is-java-safepoint.html - a good explanation about safepoints). Disadvantages of perf: • Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack walking in OpenJDK – done by Netflix and Twitter) • Doesn’t see Java symbols (hex numbers instead; special agent needed to add symbols https://github.com/jrudolph/perf-map-agent ) • Permissions must be configured to symbol files • It is necessary to develop a service which will launch perf, get stacktraces and pass them to a server.
  • 32. 32 PERF vs. JAVA AGENT And…. it happens that Netflix’s product is open sourced…
  • 33. 33 CREDITS Andrew Johnson Software Engineer at Etsy Previously: Explorys, Inc. https://www.linkedin.com/in/ajsquared Brendann Gregg Senior Performance Architect at Netflix Previously: Joyent, Oracle, Sun Microsystems http://www.brendangregg.com/index.html
  • 34. 34 BLOGS/ARTICLES Blogs: • My blog article http://ihorbobak.com/index.php/2015/08/05/cluster-profiling/ • Etsy’s blog about JVM Profiler https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/ https://codeascraft.com/2015/05/12/four-months-of-statsd-jvm-profiler-a-retrospective/ • Brendan Gregg’s blog http://www.brendangregg.com/blog/index.html Source code: • My modification of StatsD JVM Profiler https://github.com/ibobak/statsd-jvm-profiler • Original Etsy’s StatsD JVM Profiler https://github.com/etsy/statsd-jvm-profiler • Brendan Gregg’s FlameGraph https://github.com/brendangregg/FlameGraph Manuals: • InfluxDB Docs https://influxdb.com/docs/v0.9/introduction/overview.html • Overview of the JMX Technology https://docs.oracle.com/javase/tutorial/jmx/overview/index.html • JVM Tool Interface http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#starting
  • 35. 35 BOOKS / VIDEOS • Systems Performance: Enterprise and the Cloud by Brendan Gregg http://www.amazon.com/Systems-Performance- Enterprise-Brendan-Gregg/dp/0133390098 • Blazing Performance with Flame Graphs by Brendan Gregg https://www.youtube.com/watch?v=nZfNehCzGdw • Linux profiling at Netflix by Brendan Gregg https://www.youtube.com/watch?v=_Ik8oiQvWgo • Profiling Java in Production by Kaushik Srenevasan, Twitter University https://www.youtube.com/watch?v=Yg6_ulhwLw0

Editor's Notes

  • #11: P37 – divide into 2 slides
  • #12: P37 – divide into 2 slides
  • #13: P37 – divide into 2 slides
  • #14: P37 – divide into 2 slides
  • #15: P37 – divide into 2 slides