Analyzing Hadoop Using Hadoop

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing Hadoop Using Hadoop
15 Apr 2015
Sheetal Dolas
Principal Architect, Hortonworks

Who am I ?
• Principal Architect @ Hortonworks
• Most of the career has been in field, solving real life business
problems
• Last 5+ years in Big Data including Hadoop, Storm etc.
• Co-developed Cisco OpenSOC ( http://opensoc.github.io )
sheetal@hortonworks.com
@sheetal_dolas

Agenda
• Need for operational insights
• Challenges
• Data sets available
• Using Hadoop to analyze itself
• Sample reports
• Q and A

Need for Operational Insights

Need for Metrics Analysis
• Metrics can reveal the story about your cluster
• They help you understand workload characteristics
o Reveal the pain point
o Clear the misconceptions
o Drive towards action plan
• Operational insights are critical for SLA management by
improving
o System Reliability
o Uptime
o Performance
o Security

Hadoop Metrics Challenges
• Hadoop generates lot of metrics
o Host metrics (CPU, Memory, Disk, Network)
o Service metrics (JVM metrics, GC, Transactions, Performance)
o Service reports (fsck, lsr, dfs admin, audit logs)
o Job Metrics (Resource utilization, data processed, performance)
• Understanding and analyzing them is overwhelming
• No good enough tools that address the whole spectrum
• Need for deeper technology understanding

Metrics can appear like this conversation
hmm…
hmmmm…
hah!
ahem!
ahh!
eh?
Hadoop Expert Hadoop Newbie

You know all the words and their meaning

But still don’t get the meaning of
conversation

We need tools that help extract meaning out of it
Hadoop Expert Hadoop Newbie
hmm…
Hadoop has Magnificent
Metrics
hmmmm…
Hadoop Metrics Make Me Mad
hah!
Hadoop Analyzes Hadoop
ahem!
Analyze Hadoop Easily in
Minutes
ahh!
Awesome! Hail Hadoop!
eh?
Elucidative Hadoop?

Datasets available

Datasets available for analysis
• MapReduce job history log
• HDFS lsr report
• HDFS Audit log

MapReduce Job History Log

Job history log
• Stored on HDFS
• Contains all the events occurred in a job plus the event metadata
• Has its own format
o Can be parsed using Rumen API

Analyzing MapReduce Job history log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Job Log Parsing
Rumen
Job Resource
Computations
Periodically read the job
history logs from HDFS
1
Parse the logs compute
data and write it back to
Hive
2
Query data through a
preferred interface
3

Sample Reports

CPU Utilization
33%
28%
25%
8%
3% 3% 0% 0% 0%
CPU Utilization - By Queue - Week To Date
productintelligence
cfld
adhoc
hive
techsupport
mnm
webhcat
infosecurity
prodintel_small

Disk IO
53%
33%
7%
7%
0%
Data IO (GB) - By User - Yesterday
katharine.matsumoto
hadoop_sa
ebrown
mzang
justin.meyer
jmarquez
nbhupalam
rchakravarthy
pyan
rchirala
thomas.cox
User Ids

Workload Distribution Through Hour of Day
0
50
100
150
200
250
300
350
0 2 4 6 8 10 12 14 16 18 20 22
Numberofjobs
submitted
Job submission hour
Number of jobs submitted
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
0 2 4 6 8 10 12 14 16 18 20 22
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
0 2 4 6 8 10 12 14 16 18 20 22
TotaldataprocessedGBs
Job submission hour
Total Data Processed

Workload Distribution Through Day of Week
0
50
100
150
200
250
300
Numberofjobssubmitted
Job submission hour
Number of jobs submitted
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
600,000.00
700,000.00
TotaldataprocessedGBs
Job submission hour
Total Data Processed

Job Type and Status
73
143
199
Job Distribution By Type
Yesterday
Hive
MapReduce
Pig
SUCCEEDE
D
98%
FAILED
2%
KILLED
0%
Job Distribution By Status Yesterday
SUCCEEDED
FAILED
KILLED

Top 5 long running jobs - Yesterday
Job Id Job Name User Name Queue Name Job Duration
job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s
job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s
job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s
job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s
job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s

Top 5 long waiting jobs - Yesterday
Job Id Job Name User Name Queue Name
Job
Submission
Wait
job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld
5 h 39 m 38
s
job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity
5 h 19 m 46
s
job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm
5 h 18 m 47
s
job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld
5 h 11 m 12
s
job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m

Top 5 resource consuming jobs
job_id
Total
maps
Total
reduces
Requested
map GB
Requeste
d reduce
GB
total memory blocked by
job GB
job_1403277400645_1400 27,358 6 4 4 109,456
job_1403277400645_1423 27,358 3 4 4 109,444
job_1403277400645_1745 5,581 1 4 4 22,328
job_1403277400645_1497 1,807 0 4 4 7,228
job_1403277400645_1564 1,794 0 4 4 7,176

Showback Reports
Queue Name
Total Cpu
Hours
Used
Cpu
Cost
Total
Memory Gb
Hours
Blocked
Memory
Cost
Total Data Io
Gb
Data Io
Cost
Total
Network Io
Gb
Network
Io Cost
Total Cost
adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56
cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11
hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01
$23,146.20
infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47
mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45
productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14
$18,167.99
techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61
Resource Pricing
CPU Cost Per Hour: $ 0.004
Memory Cost Per Gb Per Hour: $ 0.004
Data Io Cost Per Gb: $ 0.020
Network Io Cost Per Gb: $ 0.020

HDFS lsr report

HDFS lsr report
• lsr is recursive file listing
• Contains metadata about files
o Permissions
o Owner & Group
o Replication factor
o File size
o Last modified date time
o File path
-------------------------------------------------------------------------------------------------------
|Permissions |rep factor | user | group | size | date | time| file path |
-------------------------------------------------------------------------------------------------------
drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics
-rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log

Analyzing lsr report
HDFS
lsr
report
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Periodically generate lsr repot
hdfs dfs –lsr /
Load it into hive
load data local inpath
‘/tmp/lsr.txt’ overwrite
into table lsr
preferred interface
1
2 3

HDFS lsr report – Hive Table Definition
CREATE EXTERNAL TABLE lsr (
permissions STRING,
replication STRING,
owner STRING,
group STRING,
size STRING,
date STRING,
time STRING,
file_path STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)"
) ;

HDFS lsr report – Hive View Definition
CREATE VIEW lsr_view
AS
SELECT ( CASE Substr(permissions, 1, 1)
WHEN 'd' THEN 'DIR'
ELSE 'FILE'
END ) AS file_type,
permissions,
( CASE replication
WHEN '-' THEN 0
ELSE Cast (replication AS INT)
END ) AS replication,
owner,
group,
Cast (size AS INT) AS size,
date,
time,
file_path
FROM lsr ;

Security Checks – Files Readable by All
SELECT permissions, owner, file_path
FROM lsr_view
WHERE file_type = 'FILE'
AND Substr(permissions, 8, 1) = 'r'
LIMIT 3;
Permissions Owner File Path
-rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0
-rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1
-rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt

Data loss risk – Files with low replication factor
SELECT owner, replication, file_path
FROM lsr_view
AND file_path LIKE '/apps/hive/warehouse/%'
AND replication < 3
LIMIT 3;
Owner Replication File Path
elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt
sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0

Data storage by user
SELECT owner, Sum(size) AS total_size
FROM lsr_view
GROUP BY owner
ORDER BY total_size DESC;
agrissia
30%
albarma
26%
blackupli
15%
blackwardap
8%
brilliantbox
7%
bumpkin
5%
catstoopshard
4%
cozyboyal
2%
fallenvivala
2%
fonetter
1%

Small Files
SELECT relative_size, Count(1) AS total
FROM (SELECT ( CASE size < 134217728
WHEN true THEN 'small'
ELSE 'large'
END ) AS relative_size
FROM lsr_view
WHERE file_type = 'FILE') tmp
GROUP BY relative_size;
large
10%
small
90%
SELECT Avg(size)
FROM lsr_view
WHERE file_type = 'FILE';
> 61,305,522
Average File Size

HDFS Audit Logs

HDFS audit logs
• Can be enabled by setting audit log level to INFO
• Every hdfs access request is logged
• Contains metadata about access requests
o User name (actual user and proxy user if any)
o IP Address (where request came from)
o Action (Command)
o File Name (Source and destination files involved)
-------------------------------------------------------------------------------------------------------------------------------------
|Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms |
-------------------------------------------------------------------------------------------------------------------------------------
2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null

Analyzing HDFS Audit Log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
HDFS Audit
Logs
Periodically load it into hive
load data local inpath
‘/log/Hadoop/hdfs/hdfs-
audit.log.2014-11-19’
into table hdfs_audit
2
Audit log generated
during normal
operations of HDFS
1
preferred interface
3

HDFS Audit Log – Hive Table Definition
CREATE EXTERNAL TABLE hdfs_audit (
date STRING,
time STRING,
log_level STRING,
class STRING,
allowed STRING,
user STRING,
auth_str STRING,
auth_type STRING,
proxy_user STRING,
proxy_user_auth_str STRING,
proxy_user_auth_type STRING,
ip STRING,
command STRING,
src_path STRING,
dest_path STRING,
permissions STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" =
"(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via
(S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)"
) ;

Most Frequently Used Datasets
SELECT src_path, Count(1) AS access_frequency
FROM hdfs_audit
GROUP BY src_path
ORDER BY access_frequency DESC
LIMIT 3;
File Path Access Frequency
/domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774
/domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181
/domains/drd/production/config/DBFBlockCriteria.log 4,816,841

Datasets not read even once
SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size
AS file_size
FROM lsr_view lsr
LEFT JOIN (SELECT Max(date), src_path
FROM hdfs_audit
WHERE command = 'open'
GROUP BY src_path) audit
ON ( lsr.file_path = audit.src_path )
WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL
ORDER BY creation_date DESC
LIMIT 3;
File Path Creation Date File Size
/app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465
/app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342
/app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128

Potentially intrusive users
SELECT user, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY user
ORDER BY failed_attempts DESC
LIMIT 3;
User Failed Attempts
ryan_m 266
drown_d 238
mac_t 66

Potentially malicious client hosts
SELECT ip, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY ip
LIMIT 3;
IP Address Failed Attempts
10.20.147.245 1059
10.20.145.137 1021
10.20.146.203 1018

Summary

Summary
• Hadoop generates lots of useful metrics
• Many of the datasets can be easily analyzed with a little effort
o Hive and Pig are great analytical tools
o There are inbuilt SerDes/Loaders for many of the formats
• Simple analytics on HDFS lsr, HDFS Audit, Job History can
empower DevOps to manage their clusters better

Thank You!
Questions ?

Analyzing Hadoop Using Hadoop

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Analyzing Hadoop Using Hadoop (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Analyzing Hadoop Using Hadoop