SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing Hadoop Using Hadoop
15 Apr 2015
Sheetal Dolas
Principal Architect, Hortonworks
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Who am I ?
• Principal Architect @ Hortonworks
• Most of the career has been in field, solving real life business
problems
• Last 5+ years in Big Data including Hadoop, Storm etc.
• Co-developed Cisco OpenSOC ( http://opensoc.github.io )
sheetal@hortonworks.com
@sheetal_dolas
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Need for operational insights
• Challenges
• Data sets available
• Using Hadoop to analyze itself
• Sample reports
• Q and A
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Operational Insights
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Metrics Analysis
• Metrics can reveal the story about your cluster
• They help you understand workload characteristics
o Reveal the pain point
o Clear the misconceptions
o Drive towards action plan
• Operational insights are critical for SLA management by
improving
o System Reliability
o Uptime
o Performance
o Security
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Metrics Challenges
• Hadoop generates lot of metrics
o Host metrics (CPU, Memory, Disk, Network)
o Service metrics (JVM metrics, GC, Transactions, Performance)
o Service reports (fsck, lsr, dfs admin, audit logs)
o Job Metrics (Resource utilization, data processed, performance)
• Understanding and analyzing them is overwhelming
• No good enough tools that address the whole spectrum
• Need for deeper technology understanding
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
hmm…
hmmmm…
hah!
ahem!
ahh!
eh?
Hadoop Expert Hadoop Newbie
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
You know all the words and their meaning
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
But still don’t get the meaning of
conversation
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
We need tools that help extract meaning out of it
Hadoop Expert Hadoop Newbie
hmm…
Hadoop has Magnificent
Metrics
hmmmm…
Hadoop Metrics Make Me Mad
hah!
Hadoop Analyzes Hadoop
ahem!
Analyze Hadoop Easily in
Minutes
ahh!
Awesome! Hail Hadoop!
eh?
Elucidative Hadoop?
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets available
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets available for analysis
• MapReduce job history log
• HDFS lsr report
• HDFS Audit log
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MapReduce Job History Log
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Job history log
• Stored on HDFS
• Contains all the events occurred in a job plus the event metadata
• Has its own format
o Can be parsed using Rumen API
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing MapReduce Job history log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Job Log Parsing
Rumen
Job Resource
Computations
Periodically read the job
history logs from HDFS
1
Parse the logs compute
data and write it back to
Hive
2
Query data through a
preferred interface
3
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
CPU Utilization
33%
28%
25%
8%
3% 3% 0% 0% 0%
CPU Utilization - By Queue - Week To Date
productintelligence
cfld
adhoc
hive
techsupport
mnm
webhcat
infosecurity
prodintel_small
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Disk IO
53%
33%
7%
7%
0%
Data IO (GB) - By User - Yesterday
katharine.matsumoto
hadoop_sa
ebrown
mzang
justin.meyer
jmarquez
nbhupalam
rchakravarthy
pyan
rchirala
thomas.cox
User Ids
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Workload Distribution Through Hour of Day
0
50
100
150
200
250
300
350
0 2 4 6 8 10 12 14 16 18 20 22
Numberofjobs
submitted
Job submission hour
Number of jobs submitted
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
0 2 4 6 8 10 12 14 16 18 20 22
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
0 2 4 6 8 10 12 14 16 18 20 22
TotaldataprocessedGBs
Job submission hour
Total Data Processed
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Workload Distribution Through Day of Week
0
50
100
150
200
250
300
Numberofjobssubmitted
Job submission hour
Number of jobs submitted
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
600,000.00
700,000.00
TotaldataprocessedGBs
Job submission hour
Total Data Processed
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Job Type and Status
73
143
199
Job Distribution By Type
Yesterday
Hive
MapReduce
Pig
SUCCEEDE
D
98%
FAILED
2%
KILLED
0%
Job Distribution By Status Yesterday
SUCCEEDED
FAILED
KILLED
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 long running jobs - Yesterday
Job Id Job Name User Name Queue Name Job Duration
job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s
job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s
job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s
job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s
job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 long waiting jobs - Yesterday
Job Id Job Name User Name Queue Name
Job
Submission
Wait
job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld
5 h 39 m 38
s
job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity
5 h 19 m 46
s
job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm
5 h 18 m 47
s
job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld
5 h 11 m 12
s
job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 resource consuming jobs
job_id
Total
maps
Total
reduces
Requested
map GB
Requeste
d reduce
GB
total memory blocked by
job GB
job_1403277400645_1400 27,358 6 4 4 109,456
job_1403277400645_1423 27,358 3 4 4 109,444
job_1403277400645_1745 5,581 1 4 4 22,328
job_1403277400645_1497 1,807 0 4 4 7,228
job_1403277400645_1564 1,794 0 4 4 7,176
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Showback Reports
Queue Name
Total Cpu
Hours
Used
Cpu
Cost
Total
Memory Gb
Hours
Blocked
Memory
Cost
Total Data Io
Gb
Data Io
Cost
Total
Network Io
Gb
Network
Io Cost
Total Cost
adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56
cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11
hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01
$23,146.20
infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47
mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45
productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14
$18,167.99
techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61
Resource Pricing
CPU Cost Per Hour: $ 0.004
Memory Cost Per Gb Per Hour: $ 0.004
Data Io Cost Per Gb: $ 0.020
Network Io Cost Per Gb: $ 0.020
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report
• lsr is recursive file listing
• Contains metadata about files
o Permissions
o Owner & Group
o Replication factor
o File size
o Last modified date time
o File path
-------------------------------------------------------------------------------------------------------
|Permissions |rep factor | user | group | size | date | time| file path |
-------------------------------------------------------------------------------------------------------
drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics
-rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing lsr report
HDFS
lsr
report
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Periodically generate lsr repot
hdfs dfs –lsr /
Load it into hive
load data local inpath
‘/tmp/lsr.txt’ overwrite
into table lsr
Query data through a
preferred interface
1
2 3
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report – Hive Table Definition
CREATE EXTERNAL TABLE lsr (
permissions STRING,
replication STRING,
owner STRING,
group STRING,
size STRING,
date STRING,
time STRING,
file_path STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)"
) ;
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report – Hive View Definition
CREATE VIEW lsr_view
AS
SELECT ( CASE Substr(permissions, 1, 1)
WHEN 'd' THEN 'DIR'
ELSE 'FILE'
END ) AS file_type,
permissions,
( CASE replication
WHEN '-' THEN 0
ELSE Cast (replication AS INT)
END ) AS replication,
owner,
group,
Cast (size AS INT) AS size,
date,
time,
file_path
FROM lsr ;
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security Checks – Files Readable by All
SELECT permissions, owner, file_path
FROM lsr_view
WHERE file_type = 'FILE'
AND Substr(permissions, 8, 1) = 'r'
LIMIT 3;
Permissions Owner File Path
-rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0
-rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1
-rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data loss risk – Files with low replication factor
SELECT owner, replication, file_path
FROM lsr_view
WHERE file_type = 'FILE'
AND file_path LIKE '/apps/hive/warehouse/%'
AND replication < 3
LIMIT 3;
Owner Replication File Path
elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt
sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data storage by user
SELECT owner, Sum(size) AS total_size
FROM lsr_view
WHERE file_type = 'FILE'
GROUP BY owner
ORDER BY total_size DESC;
agrissia
30%
albarma
26%
blackupli
15%
blackwardap
8%
brilliantbox
7%
bumpkin
5%
catstoopshard
4%
cozyboyal
2%
fallenvivala
2%
fonetter
1%
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Small Files
SELECT relative_size, Count(1) AS total
FROM (SELECT ( CASE size < 134217728
WHEN true THEN 'small'
ELSE 'large'
END ) AS relative_size
FROM lsr_view
WHERE file_type = 'FILE') tmp
GROUP BY relative_size;
large
10%
small
90%
SELECT Avg(size)
FROM lsr_view
WHERE file_type = 'FILE';
> 61,305,522
Average File Size
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS Audit Logs
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS audit logs
• Can be enabled by setting audit log level to INFO
• Every hdfs access request is logged
• Contains metadata about access requests
o User name (actual user and proxy user if any)
o IP Address (where request came from)
o Action (Command)
o File Name (Source and destination files involved)
-------------------------------------------------------------------------------------------------------------------------------------
|Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms |
-------------------------------------------------------------------------------------------------------------------------------------
2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing HDFS Audit Log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
HDFS Audit
Logs
Periodically load it into hive
load data local inpath
‘/log/Hadoop/hdfs/hdfs-
audit.log.2014-11-19’
into table hdfs_audit
2
Audit log generated
during normal
operations of HDFS
1
Query data through a
preferred interface
3
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS Audit Log – Hive Table Definition
CREATE EXTERNAL TABLE hdfs_audit (
date STRING,
time STRING,
log_level STRING,
class STRING,
allowed STRING,
user STRING,
auth_str STRING,
auth_type STRING,
proxy_user STRING,
proxy_user_auth_str STRING,
proxy_user_auth_type STRING,
ip STRING,
command STRING,
src_path STRING,
dest_path STRING,
permissions STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" =
"(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via
(S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)"
) ;
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Most Frequently Used Datasets
SELECT src_path, Count(1) AS access_frequency
FROM hdfs_audit
GROUP BY src_path
ORDER BY access_frequency DESC
LIMIT 3;
File Path Access Frequency
/domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774
/domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181
/domains/drd/production/config/DBFBlockCriteria.log 4,816,841
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets not read even once
SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size
AS file_size
FROM lsr_view lsr
LEFT JOIN (SELECT Max(date), src_path
FROM hdfs_audit
WHERE command = 'open'
GROUP BY src_path) audit
ON ( lsr.file_path = audit.src_path )
WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL
ORDER BY creation_date DESC
LIMIT 3;
File Path Creation Date File Size
/app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465
/app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342
/app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Potentially intrusive users
SELECT user, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY user
ORDER BY failed_attempts DESC
LIMIT 3;
User Failed Attempts
ryan_m 266
drown_d 238
mac_t 66
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Potentially malicious client hosts
SELECT ip, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY ip
LIMIT 3;
IP Address Failed Attempts
10.20.147.245 1059
10.20.145.137 1021
10.20.146.203 1018
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
• Hadoop generates lots of useful metrics
• Many of the datasets can be easily analyzed with a little effort
o Hive and Pig are great analytical tools
o There are inbuilt SerDes/Loaders for many of the formats
• Simple analytics on HDFS lsr, HDFS Audit, Job History can
empower DevOps to manage their clusters better
Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank You!
Questions ?

More Related Content

PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
YARN - Presented At Dallas Hadoop User Group
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
Tune up Yarn and Hive
 
PPTX
Apache Tez - Accelerating Hadoop Data Processing
PDF
Fast SQL on Hadoop, Really?
PPTX
Apache Tez – Present and Future
PDF
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
YARN - Presented At Dallas Hadoop User Group
Apache Tez - A New Chapter in Hadoop Data Processing
Tune up Yarn and Hive
 
Apache Tez - Accelerating Hadoop Data Processing
Fast SQL on Hadoop, Really?
Apache Tez – Present and Future
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!

What's hot (20)

PPTX
Hive+Tez: A performance deep dive
PPTX
February 2014 HUG : Pig On Tez
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PDF
Quick Introduction to Apache Tez
PPTX
Pig on Tez: Low Latency Data Processing with Big Data
PPTX
February 2014 HUG : Hive On Tez
PDF
Tez: Accelerating Data Pipelines - fifthel
PPTX
Enabling Diverse Workload Scheduling in YARN
PPTX
Apache Tez : Accelerating Hadoop Query Processing
PPTX
Hive on spark is blazing fast or is it final
PPTX
YARN Ready: Apache Spark
PPTX
LLAP: long-lived execution in Hive
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Apache Hadoop YARN 2015: Present and Future
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
A Multi Colored YARN
PPTX
Apache Hadoop YARN: Present and Future
PPTX
Spark vstez
Hive+Tez: A performance deep dive
February 2014 HUG : Pig On Tez
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Apache Tez: Accelerating Hadoop Query Processing
Quick Introduction to Apache Tez
Pig on Tez: Low Latency Data Processing with Big Data
February 2014 HUG : Hive On Tez
Tez: Accelerating Data Pipelines - fifthel
Enabling Diverse Workload Scheduling in YARN
Apache Tez : Accelerating Hadoop Query Processing
Hive on spark is blazing fast or is it final
YARN Ready: Apache Spark
LLAP: long-lived execution in Hive
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN 2015: Present and Future
Hive edw-dataworks summit-eu-april-2017
LLAP: Sub-Second Analytical Queries in Hive
A Multi Colored YARN
Apache Hadoop YARN: Present and Future
Spark vstez
Ad

Viewers also liked (8)

PPTX
Hadoop 2.0, MRv2 and YARN - Module 9
PPTX
The hadoop 2.0 ecosystem and yarn
PPT
Negotiating Meaning
PDF
Distributed computing the Google way
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PPTX
Apache NiFi- MiNiFi meetup Slides
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PPTX
Hadoop & HDFS for Beginners
Hadoop 2.0, MRv2 and YARN - Module 9
The hadoop 2.0 ecosystem and yarn
Negotiating Meaning
Distributed computing the Google way
Python in the Hadoop Ecosystem (Rock Health presentation)
Apache NiFi- MiNiFi meetup Slides
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop & HDFS for Beginners
Ad

Similar to Analyzing Hadoop Using Hadoop (20)

PPTX
Hadoop crashcourse v3
PPTX
201305 hadoop jpl-v3
PDF
Introduction to Hadoop
PPTX
Compute-based sizing and system dashboard
PPTX
Hadoop crash course workshop at Hadoop Summit
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PPTX
Don't Let Security Be The 'Elephant in the Room'
PPTX
Yahoo! Hack Europe
PDF
Tutorial hadoop hdfs_map_reduce
PPTX
Why is my Hadoop* job slow?
PPTX
Hadoop operations-2014-strata-new-york-v5
PDF
Introduction to Hadoop
PDF
Hortonworks and Platfora in Financial Services - Webinar
PPTX
Mrinal devadas, Hortonworks Making Sense Of Big Data
PDF
Apache Hadoop on the Open Cloud
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
PPTX
Keep your hadoop cluster at its best! v4
PPTX
Keep your Hadoop Cluster at its Best
PDF
Storm Demo Talk - Colorado Springs May 2015
PPTX
Hadoop Operations - Best Practices from the Field
Hadoop crashcourse v3
201305 hadoop jpl-v3
Introduction to Hadoop
Compute-based sizing and system dashboard
Hadoop crash course workshop at Hadoop Summit
Hortonworks Technical Workshop: What's New in HDP 2.3
Don't Let Security Be The 'Elephant in the Room'
Yahoo! Hack Europe
Tutorial hadoop hdfs_map_reduce
Why is my Hadoop* job slow?
Hadoop operations-2014-strata-new-york-v5
Introduction to Hadoop
Hortonworks and Platfora in Financial Services - Webinar
Mrinal devadas, Hortonworks Making Sense Of Big Data
Apache Hadoop on the Open Cloud
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Keep your hadoop cluster at its best! v4
Keep your Hadoop Cluster at its Best
Storm Demo Talk - Colorado Springs May 2015
Hadoop Operations - Best Practices from the Field

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Analyzing Hadoop Using Hadoop

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing Hadoop Using Hadoop 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda • Need for operational insights • Challenges • Data sets available • Using Hadoop to analyze itself • Sample reports • Q and A
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Operational Insights
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Metrics Analysis • Metrics can reveal the story about your cluster • They help you understand workload characteristics o Reveal the pain point o Clear the misconceptions o Drive towards action plan • Operational insights are critical for SLA management by improving o System Reliability o Uptime o Performance o Security
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Metrics Challenges • Hadoop generates lot of metrics o Host metrics (CPU, Memory, Disk, Network) o Service metrics (JVM metrics, GC, Transactions, Performance) o Service reports (fsck, lsr, dfs admin, audit logs) o Job Metrics (Resource utilization, data processed, performance) • Understanding and analyzing them is overwhelming • No good enough tools that address the whole spectrum • Need for deeper technology understanding
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation hmm… hmmmm… hah! ahem! ahh! eh? Hadoop Expert Hadoop Newbie
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation You know all the words and their meaning
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation But still don’t get the meaning of conversation
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved We need tools that help extract meaning out of it Hadoop Expert Hadoop Newbie hmm… Hadoop has Magnificent Metrics hmmmm… Hadoop Metrics Make Me Mad hah! Hadoop Analyzes Hadoop ahem! Analyze Hadoop Easily in Minutes ahh! Awesome! Hail Hadoop! eh? Elucidative Hadoop?
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets available
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets available for analysis • MapReduce job history log • HDFS lsr report • HDFS Audit log
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MapReduce Job History Log
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Job history log • Stored on HDFS • Contains all the events occurred in a job plus the event metadata • Has its own format o Can be parsed using Rumen API
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing MapReduce Job history log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Job Log Parsing Rumen Job Resource Computations Periodically read the job history logs from HDFS 1 Parse the logs compute data and write it back to Hive 2 Query data through a preferred interface 3
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved CPU Utilization 33% 28% 25% 8% 3% 3% 0% 0% 0% CPU Utilization - By Queue - Week To Date productintelligence cfld adhoc hive techsupport mnm webhcat infosecurity prodintel_small
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Disk IO 53% 33% 7% 7% 0% Data IO (GB) - By User - Yesterday katharine.matsumoto hadoop_sa ebrown mzang justin.meyer jmarquez nbhupalam rchakravarthy pyan rchirala thomas.cox User Ids
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Hour of Day 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 14 16 18 20 22 Numberofjobs submitted Job submission hour Number of jobs submitted 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 0 2 4 6 8 10 12 14 16 18 20 22 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 0 2 4 6 8 10 12 14 16 18 20 22 TotaldataprocessedGBs Job submission hour Total Data Processed
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Day of Week 0 50 100 150 200 250 300 Numberofjobssubmitted Job submission hour Number of jobs submitted 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 600,000.00 700,000.00 TotaldataprocessedGBs Job submission hour Total Data Processed
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Job Type and Status 73 143 199 Job Distribution By Type Yesterday Hive MapReduce Pig SUCCEEDE D 98% FAILED 2% KILLED 0% Job Distribution By Status Yesterday SUCCEEDED FAILED KILLED
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 long running jobs - Yesterday Job Id Job Name User Name Queue Name Job Duration job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 long waiting jobs - Yesterday Job Id Job Name User Name Queue Name Job Submission Wait job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld 5 h 39 m 38 s job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity 5 h 19 m 46 s job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm 5 h 18 m 47 s job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld 5 h 11 m 12 s job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 resource consuming jobs job_id Total maps Total reduces Requested map GB Requeste d reduce GB total memory blocked by job GB job_1403277400645_1400 27,358 6 4 4 109,456 job_1403277400645_1423 27,358 3 4 4 109,444 job_1403277400645_1745 5,581 1 4 4 22,328 job_1403277400645_1497 1,807 0 4 4 7,228 job_1403277400645_1564 1,794 0 4 4 7,176
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Showback Reports Queue Name Total Cpu Hours Used Cpu Cost Total Memory Gb Hours Blocked Memory Cost Total Data Io Gb Data Io Cost Total Network Io Gb Network Io Cost Total Cost adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56 cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11 hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01 $23,146.20 infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47 mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45 productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14 $18,167.99 techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61 Resource Pricing CPU Cost Per Hour: $ 0.004 Memory Cost Per Gb Per Hour: $ 0.004 Data Io Cost Per Gb: $ 0.020 Network Io Cost Per Gb: $ 0.020
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report • lsr is recursive file listing • Contains metadata about files o Permissions o Owner & Group o Replication factor o File size o Last modified date time o File path ------------------------------------------------------------------------------------------------------- |Permissions |rep factor | user | group | size | date | time| file path | ------------------------------------------------------------------------------------------------------- drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics -rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing lsr report HDFS lsr report Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Periodically generate lsr repot hdfs dfs –lsr / Load it into hive load data local inpath ‘/tmp/lsr.txt’ overwrite into table lsr Query data through a preferred interface 1 2 3
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive Table Definition CREATE EXTERNAL TABLE lsr ( permissions STRING, replication STRING, owner STRING, group STRING, size STRING, date STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)" ) ;
  • 30. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive View Definition CREATE VIEW lsr_view AS SELECT ( CASE Substr(permissions, 1, 1) WHEN 'd' THEN 'DIR' ELSE 'FILE' END ) AS file_type, permissions, ( CASE replication WHEN '-' THEN 0 ELSE Cast (replication AS INT) END ) AS replication, owner, group, Cast (size AS INT) AS size, date, time, file_path FROM lsr ;
  • 31. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 32. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Security Checks – Files Readable by All SELECT permissions, owner, file_path FROM lsr_view WHERE file_type = 'FILE' AND Substr(permissions, 8, 1) = 'r' LIMIT 3; Permissions Owner File Path -rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0 -rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1 -rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
  • 33. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data loss risk – Files with low replication factor SELECT owner, replication, file_path FROM lsr_view WHERE file_type = 'FILE' AND file_path LIKE '/apps/hive/warehouse/%' AND replication < 3 LIMIT 3; Owner Replication File Path elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data storage by user SELECT owner, Sum(size) AS total_size FROM lsr_view WHERE file_type = 'FILE' GROUP BY owner ORDER BY total_size DESC; agrissia 30% albarma 26% blackupli 15% blackwardap 8% brilliantbox 7% bumpkin 5% catstoopshard 4% cozyboyal 2% fallenvivala 2% fonetter 1%
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Small Files SELECT relative_size, Count(1) AS total FROM (SELECT ( CASE size < 134217728 WHEN true THEN 'small' ELSE 'large' END ) AS relative_size FROM lsr_view WHERE file_type = 'FILE') tmp GROUP BY relative_size; large 10% small 90% SELECT Avg(size) FROM lsr_view WHERE file_type = 'FILE'; > 61,305,522 Average File Size
  • 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS Audit Logs
  • 37. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS audit logs • Can be enabled by setting audit log level to INFO • Every hdfs access request is logged • Contains metadata about access requests o User name (actual user and proxy user if any) o IP Address (where request came from) o Action (Command) o File Name (Source and destination files involved) ------------------------------------------------------------------------------------------------------------------------------------- |Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms | ------------------------------------------------------------------------------------------------------------------------------------- 2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null
  • 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing HDFS Audit Log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI HDFS Audit Logs Periodically load it into hive load data local inpath ‘/log/Hadoop/hdfs/hdfs- audit.log.2014-11-19’ into table hdfs_audit 2 Audit log generated during normal operations of HDFS 1 Query data through a preferred interface 3
  • 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS Audit Log – Hive Table Definition CREATE EXTERNAL TABLE hdfs_audit ( date STRING, time STRING, log_level STRING, class STRING, allowed STRING, user STRING, auth_str STRING, auth_type STRING, proxy_user STRING, proxy_user_auth_str STRING, proxy_user_auth_type STRING, ip STRING, command STRING, src_path STRING, dest_path STRING, permissions STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via (S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)" ) ;
  • 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Most Frequently Used Datasets SELECT src_path, Count(1) AS access_frequency FROM hdfs_audit GROUP BY src_path ORDER BY access_frequency DESC LIMIT 3; File Path Access Frequency /domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774 /domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181 /domains/drd/production/config/DBFBlockCriteria.log 4,816,841
  • 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets not read even once SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size AS file_size FROM lsr_view lsr LEFT JOIN (SELECT Max(date), src_path FROM hdfs_audit WHERE command = 'open' GROUP BY src_path) audit ON ( lsr.file_path = audit.src_path ) WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL ORDER BY creation_date DESC LIMIT 3; File Path Creation Date File Size /app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465 /app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342 /app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128
  • 43. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Potentially intrusive users SELECT user, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY user ORDER BY failed_attempts DESC LIMIT 3; User Failed Attempts ryan_m 266 drown_d 238 mac_t 66
  • 44. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Potentially malicious client hosts SELECT ip, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY ip LIMIT 3; IP Address Failed Attempts 10.20.147.245 1059 10.20.145.137 1021 10.20.146.203 1018
  • 45. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary
  • 46. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary • Hadoop generates lots of useful metrics • Many of the datasets can be easily analyzed with a little effort o Hive and Pig are great analytical tools o There are inbuilt SerDes/Loaders for many of the formats • Simple analytics on HDFS lsr, HDFS Audit, Job History can empower DevOps to manage their clusters better
  • 47. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank You! Questions ?