SlideShare a Scribd company logo
PRESENTATION TITLE GOES HEREHadoop 2 : New and Noteworthy
Sujee Maniyam, ElephantScale
sujee@ElephantScale.com
http://ElephantScale.com
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
SNIA Legal Notice
!   The material contained in this tutorial is copyrighted by the SNIA unless
otherwise noted.
!   Member companies and individual members may use this material in
presentations and literature under the following conditions:
!   Any slide or slides used must be reproduced in their entirety without modification
!   The SNIA must be acknowledged as the source of any material used in the body of
any document containing material from these presentations.
!   This presentation is a project of the SNIA Education Committee.
!   Neither the author nor the presenter is an attorney and nothing in this
presentation is intended to be, or should be construed as legal advice or an
opinion of counsel. If you need legal advice or a legal opinion please
contact your attorney.
!   The information presented herein represents the author's personal opinion
and current understanding of the relevant issues involved. The author, the
presenter, and the SNIA do not assume any responsibility or liability for
damages arising out of any reliance on or use of this information.
NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.
2
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Abstract
!   Hadoop 2 : New And Noteworthy Features
!   This session will appeal to Data Center Managers, Development
Managers, and those that are looking for an overview of ‘whats
new’ in Hadoop 2 platform. The session will highlight some of the
notable features in Hadoop 2.
3
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Quick Poll
!   How many of you are NEW to Hadoop?
!   How many of you are USING Hadoop?
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Timeline
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Versions – J
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Versions – Simplified
Hadoop 1 Hadooop 2
1.2.1 (aug 2013) 2.2.0 : (oct 2013)
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Feature Matrix
Component Feature V1 v2
HDFS NameNode High Availability X
Namenode federation X
Snapshots X
NFS v3 access to HDFS X
Improved IO X
Processing MapReduce v1 X
YARN (MapReduce v2) X
Other Kerberos security X X
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NEXT
!   NameNode High Availability
!   Federation
!   Snapshots
!   NFS
!   Improved IO
9
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Architecture (V1)
10
Name Node
Data Node Data NodeData NodeData Node
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Name Node High Availability
!   HDFS has (had) a ONE NameNode/ many Datanode
design
!   This leads to ‘Single Point of Failure’ (SPOF) for Name
Node
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NameNode Is Very Important In A
Cluster
12
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Is Hadoop NN Failure A Big Deal?
!   At Yahoo study
!   18 month study
!   22 failure on 25 clusters
!   0.58 failures per cluster per year
!   Only half of them would have benefited from HA
!   à 0.23 failure / year / cluster
! http://www.slideshare.net/Hadoop_Summit/hdfs-
namenode-high-availability
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Still Needs To Be Fixed
!   Downtime may be acceptable for batch workloads
!   But not acceptable for running real time workloads like
HBase that depend on HDFS
!   Downtime (even minutes) is not acceptable
!   Make Hadoop more Enterprise friendly
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
How Do We Fix A Single
NameNode Failure?
!   Have two Namenodes !
!   One ACTIVE and another PASSIVE
!   When Active NN fails, Passive one will take over
!   Fail over can be automated
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Architecture (v1)
16
Name Node
Data Node Data NodeData NodeData Node
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NameNode HA (V2)
17
Name Node
1
(active)
Data Node Data NodeData NodeData Node
Name Node
2
(passive)
Shared
storage
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NameNode HA : Shared Storage
(c) ElephantScale.com, 2014
18
Name Node
1
(active)
Data Node Data NodeData NodeData Node
Name Node
2
(passive)
Filer
Option 1) external filer
Option 2) Quorum Journal
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Namenode HA
!   Namenode meta data is written to a shared storage
(external filer or Quorum Journal Manager)
!   Only ONE active NN can write to shared storage
!   Passive NN reads and replays meta data from shared
storage
!   When Active NN fails, passive NN is promoted to active
!   Can be manual or automatic
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NameNode HA Setup
20
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NEXT
!   NameNode High Availability
!   Federation
!   Snapshots
!   NFS
!   Improved IO
21
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Namenode Federation
!   Namenode stores meta data in memory
!   For large (very large) clusters, NN could exhaust
memory
!   Spread meta-data over mulitiple namenodes
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Federation
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Federation
!   Now the namespace is divided
!   /hbase à NN1
!   /user à NN2
!   /hive à NN3
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Federation
!   Namespace is partitioned into ‘block pools’
!   Datanodes are shared across cluster
!   They store blocks for different pools
!   Datanodes send heart-beats to all NNs
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NEXT
!   NameNode High Availability
!   Federation
!   Snapshots
!   NFS
!   Improved IO
26
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Snapshots
!   Wait, doesn’t HDFS makes replicas?
!   Yes
!   But it doesn’t save you from :
hdfs dfs –rm –r /data
!   ‘Trash’ feature only works for CLI utilities
!   You can delete files using API.. Poof gone
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Snapshots
!   Recover from user errors, other disasters
!   Peroidic snapshots
!   E.g : daily backups… keep them for 15 days
!   Snapshotting is
!   Efficient (no data duplication, copy on write)
!   Fast
!   snapshot part of file system (not the whole thing)
! http://cdn.oreillystatic.com/en/assets/1/event/100/HDFS
%20Snapshots%20and%20Beyond%20Presentation.pdf
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NEXT
!   NameNode High Availability
!   Federation
!   Snapshots
!   NFS
!   Improved IO
29
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NFS Access to HDFS
!   HDFS is a userland file system
!   Not a kernel file system
!   So most linux programs can not read/write data to HDFS
!   We use ‘hdfs’ command line utils
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NFS Access to HDFS
!   HDFS supports NFS protocol starting with v2
!   NFS is done via gateway machine
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
NEXT
!   NameNode High Availability
!   Federation
!   Snapshots
!   NFS
!   Improved IO
32
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS Improved IO
!   Lots of performance fixes from v1 à v2
!   Quick comparison
!   Multi threaded random-read
!   HDFS v1 : 264 MB/sec
!   HDFS v2 : 1395 MB /sec ( 5x !)
Source :
http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-
hadoop-forum
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
V2 Features
! HDFS
!   Processing
!   YARN
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce V1
!   MRV1 proved itself as a reliable batch processing
framework!
!   One Job Tracker (master) and many task tracker
(workers)
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce Architecture
36
Job Tracker
Task Tracker Task TrackerTask TrackerTask Tracker
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MRV1 Limitations
!   Only supports one programming paradigm
!   Batch processing
!   Alternate processing is hard to (or not possible)
implement on top of MRV1
!   Real time processing
!   In-memory data
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MRV1 Limitations
!   Single Job Tracker (JT) à single point of failure
!   JT Failure kills all running jobs (and queued jobs)
!   JT started hit scalability limitations for very large clusters
!   4,000 nodes
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Looking Ahead
HDFS
MRV1
1) Processing
2) Resource
management
HDFS
YARN
(resource management)
mapreduce other
Hadoop v1 Hadoop v2
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved. 40
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Yarn
!   MRV1 did
!   Resource Management
!   And Processing
!   Separate both out
!   Yarn for resource management
!   Mapreduce / other frameworks for processing
!   Now mapreduce is ‘just another app’
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Yarn Architecture
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
YARN Architecture
!   resource manager : manages the resource for entire
cluster
!   node manager : manages resources a single node
!   Containers : resource buckets ( 2 cpu + 8 G RAM)
!   application masters : one for each application
!   batch mapreduce, storm …etc
!   Manages application scheduling and execution
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Adoption of YARN
!   Standard on Hadoop v2
!   Already running at Yahoo at scale
!   Lot of applications are already moving to YARN
architecture
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Apps on Yarn
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Graph
(giraph)
realtime
(hbase)
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Apps on YARN
!   Storm : real time event processing
!   Giraph : graph processing (in memory)
!   Spark : in-memory, iterative processing
!   Hbase
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce on YARN
!   MapReduce is NOT going anywhere
!   Works very well for batch processing
!   Proven
!   Lots of code out there
!   No more single JobTracker
!   Each MapReduce job runs an Application
!   So failure one AppMaster only causes that job to fail
!   Other jobs are insulated
!   Better performance
!   MR jobs scale / utilize cluster better in Yarn (1.5 x – 2x )
(c) ElephantScale.com, 2014
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce on YARN
(c) ElephantScale.com, 2014
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Writing A YARN Application
! http://hadoop.apache.org/docs/stable/hadoop-yarn/
hadoop-yarn-site/WritingYarnApplications.html
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
So Which Hadoop Should I Use?
!   If you are starting now…
!   Hadoop 2
!   Already using Hadoop 1
!   Worth the upgrade (new features / performance)
!   How do I migrate?
!   Recommended : Standup a separate v2 cluster and migrate data
over
!   In place update? (yeek!)
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Distributions
Distribution Hadoop v1 Hadoop v2
Cloudera CDH 3.x / CDH 4.x CDH 5.x
Horton Works HDP 1.x HDP 2.x
Pivotal HD
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Future…
!   HDFS
!   Mirroring across data centers
!   Work well with SSD (solid state drives / flash drives)
!   YARN
!   Better containers (not just JVMs)
!   Performance
!   Make Resource Manager HA
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Thanks & Questions?
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Attribution & Feedback
54
Please send any questions or comments regarding this SNIA
Tutorial to tracktutorials@snia.org
The SNIA Education Committee thanks the following
individuals for their contributions to this Tutorial.
Authorship History
Sujee Maniyam (Sept 2014)
Additional Contributors
Joseph White : Review & Feedback
Hadoop 2 : New and Noteworthy
© 2013 Storage Networking Industry Association. All Rights Reserved.
Backup Slides
55

More Related Content

PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
PPTX
Hadoop from Hive with Stinger to Tez
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Apache Spark & Hadoop
PDF
Improving HDFS Availability with IPC Quality of Service
PPTX
Spark vstez
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hadoop from Hive with Stinger to Tez
Build a Time Series Application with Apache Spark and Apache HBase
Apache Spark & Hadoop
Improving HDFS Availability with IPC Quality of Service
Spark vstez
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

What's hot (19)

PPTX
Hive at Yahoo: Letters from the trenches
PPTX
Architecting a Fraud Detection Application with Hadoop
PPTX
PDF
TriHUG Feb: Hive on spark
PPTX
Introduction to Data Analyst Training
PDF
Apache Spark Overview
PDF
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PDF
Introduction to Spark on Hadoop
PDF
Hadoop 2 - More than MapReduce
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Keynote: Getting Serious about MySQL and Hadoop at Continuent
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
Hadoop And Their Ecosystem
PPTX
What's new in Hadoop Common and HDFS
PPTX
Hive+Tez: A performance deep dive
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Intro to Apache Spark by Marco Vasquez
Hive at Yahoo: Letters from the trenches
Architecting a Fraud Detection Application with Hadoop
TriHUG Feb: Hive on spark
Introduction to Data Analyst Training
Apache Spark Overview
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Spark SQL versus Apache Drill: Different Tools with Different Rules
Introduction to Spark on Hadoop
Hadoop 2 - More than MapReduce
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Apache Tez - A New Chapter in Hadoop Data Processing
Hadoop And Their Ecosystem
What's new in Hadoop Common and HDFS
Hive+Tez: A performance deep dive
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Intro to Apache Spark by Marco Vasquez
Ad

Viewers also liked (10)

PDF
Big Data: Querying complex JSON data with BigInsights and Hadoop
PDF
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
PDF
Big Data: Getting started with Big SQL self-study guide
PDF
Big Data: HBase and Big SQL self-study lab
PDF
Big Data: Big SQL and HBase
PDF
Big Data: Working with Big SQL data from Spark
PPTX
Introduction to Cloudera's Administrator Training for Apache Hadoop
PDF
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
PDF
Big Data: SQL on Hadoop from IBM
PPTX
Big Data Analytics with Hadoop
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Getting started with Big SQL self-study guide
Big Data: HBase and Big SQL self-study lab
Big Data: Big SQL and HBase
Big Data: Working with Big SQL data from Spark
Introduction to Cloudera's Administrator Training for Apache Hadoop
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Big Data: SQL on Hadoop from IBM
Big Data Analytics with Hadoop
Ad

Similar to Hadoop2 new and noteworthy SNIA conf (20)

PDF
field_guide_to_hadoop_pentaho
PDF
Dallas TDWI Meeting Dec. 2012: Hadoop
PDF
Sam fineberg big_data_hadoop_storage_options_3v9-1
DOCX
Hadoop Tutorial for Beginners
PPT
Big Data and Hadoop Basics
PPTX
PPTX
Introduction of Big data and Hadoop
PDF
Hadoop tutorial-pdf.pdf
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PDF
Big data overview by Edgars
PPTX
OPERATING SYSTEM .pptx
PDF
Hadoop description
PPTX
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
ODP
Hadoop seminar
ODP
HDFS presented by VIJAY
PDF
Introduction to Hadoop
PPTX
Big data and hadoop product page
PDF
Inside the Hadoop Machine @ VMworld
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
field_guide_to_hadoop_pentaho
Dallas TDWI Meeting Dec. 2012: Hadoop
Sam fineberg big_data_hadoop_storage_options_3v9-1
Hadoop Tutorial for Beginners
Big Data and Hadoop Basics
Introduction of Big data and Hadoop
Hadoop tutorial-pdf.pdf
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Big data overview by Edgars
OPERATING SYSTEM .pptx
Hadoop description
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop seminar
HDFS presented by VIJAY
Introduction to Hadoop
Big data and hadoop product page
Inside the Hadoop Machine @ VMworld
App Cap2956v2 121001194956 Phpapp01 (1)

More from Sujee Maniyam (8)

PDF
Reference architecture for Internet of Things
PDF
Hadoop to spark-v2
PDF
Building secure NoSQL applications nosqlnow_conf_2014
PDF
Launching your career in Big Data
PDF
Hadoop security landscape
PDF
Spark Intro @ analytics big data summit
PPTX
Cost effective BigData Processing on Amazon EC2
PPTX
Iphone client-server app with Rails backend (v3)
Reference architecture for Internet of Things
Hadoop to spark-v2
Building secure NoSQL applications nosqlnow_conf_2014
Launching your career in Big Data
Hadoop security landscape
Spark Intro @ analytics big data summit
Cost effective BigData Processing on Amazon EC2
Iphone client-server app with Rails backend (v3)

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks

Hadoop2 new and noteworthy SNIA conf

  • 1. PRESENTATION TITLE GOES HEREHadoop 2 : New and Noteworthy Sujee Maniyam, ElephantScale sujee@ElephantScale.com http://ElephantScale.com
  • 2. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. SNIA Legal Notice !   The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. !   Member companies and individual members may use this material in presentations and literature under the following conditions: !   Any slide or slides used must be reproduced in their entirety without modification !   The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. !   This presentation is a project of the SNIA Education Committee. !   Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. !   The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
  • 3. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Abstract !   Hadoop 2 : New And Noteworthy Features !   This session will appeal to Data Center Managers, Development Managers, and those that are looking for an overview of ‘whats new’ in Hadoop 2 platform. The session will highlight some of the notable features in Hadoop 2. 3
  • 4. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Quick Poll !   How many of you are NEW to Hadoop? !   How many of you are USING Hadoop?
  • 5. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Timeline
  • 6. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Versions – J
  • 7. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Versions – Simplified Hadoop 1 Hadooop 2 1.2.1 (aug 2013) 2.2.0 : (oct 2013)
  • 8. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Feature Matrix Component Feature V1 v2 HDFS NameNode High Availability X Namenode federation X Snapshots X NFS v3 access to HDFS X Improved IO X Processing MapReduce v1 X YARN (MapReduce v2) X Other Kerberos security X X
  • 9. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NEXT !   NameNode High Availability !   Federation !   Snapshots !   NFS !   Improved IO 9
  • 10. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Architecture (V1) 10 Name Node Data Node Data NodeData NodeData Node
  • 11. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Name Node High Availability !   HDFS has (had) a ONE NameNode/ many Datanode design !   This leads to ‘Single Point of Failure’ (SPOF) for Name Node
  • 12. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NameNode Is Very Important In A Cluster 12
  • 13. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Is Hadoop NN Failure A Big Deal? !   At Yahoo study !   18 month study !   22 failure on 25 clusters !   0.58 failures per cluster per year !   Only half of them would have benefited from HA !   à 0.23 failure / year / cluster ! http://www.slideshare.net/Hadoop_Summit/hdfs- namenode-high-availability
  • 14. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Still Needs To Be Fixed !   Downtime may be acceptable for batch workloads !   But not acceptable for running real time workloads like HBase that depend on HDFS !   Downtime (even minutes) is not acceptable !   Make Hadoop more Enterprise friendly
  • 15. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. How Do We Fix A Single NameNode Failure? !   Have two Namenodes ! !   One ACTIVE and another PASSIVE !   When Active NN fails, Passive one will take over !   Fail over can be automated
  • 16. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Architecture (v1) 16 Name Node Data Node Data NodeData NodeData Node
  • 17. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NameNode HA (V2) 17 Name Node 1 (active) Data Node Data NodeData NodeData Node Name Node 2 (passive) Shared storage
  • 18. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NameNode HA : Shared Storage (c) ElephantScale.com, 2014 18 Name Node 1 (active) Data Node Data NodeData NodeData Node Name Node 2 (passive) Filer Option 1) external filer Option 2) Quorum Journal
  • 19. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Namenode HA !   Namenode meta data is written to a shared storage (external filer or Quorum Journal Manager) !   Only ONE active NN can write to shared storage !   Passive NN reads and replays meta data from shared storage !   When Active NN fails, passive NN is promoted to active !   Can be manual or automatic
  • 20. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NameNode HA Setup 20
  • 21. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NEXT !   NameNode High Availability !   Federation !   Snapshots !   NFS !   Improved IO 21
  • 22. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Namenode Federation !   Namenode stores meta data in memory !   For large (very large) clusters, NN could exhaust memory !   Spread meta-data over mulitiple namenodes
  • 23. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Federation
  • 24. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Federation !   Now the namespace is divided !   /hbase à NN1 !   /user à NN2 !   /hive à NN3
  • 25. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Federation !   Namespace is partitioned into ‘block pools’ !   Datanodes are shared across cluster !   They store blocks for different pools !   Datanodes send heart-beats to all NNs
  • 26. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NEXT !   NameNode High Availability !   Federation !   Snapshots !   NFS !   Improved IO 26
  • 27. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Snapshots !   Wait, doesn’t HDFS makes replicas? !   Yes !   But it doesn’t save you from : hdfs dfs –rm –r /data !   ‘Trash’ feature only works for CLI utilities !   You can delete files using API.. Poof gone
  • 28. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Snapshots !   Recover from user errors, other disasters !   Peroidic snapshots !   E.g : daily backups… keep them for 15 days !   Snapshotting is !   Efficient (no data duplication, copy on write) !   Fast !   snapshot part of file system (not the whole thing) ! http://cdn.oreillystatic.com/en/assets/1/event/100/HDFS %20Snapshots%20and%20Beyond%20Presentation.pdf
  • 29. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NEXT !   NameNode High Availability !   Federation !   Snapshots !   NFS !   Improved IO 29
  • 30. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NFS Access to HDFS !   HDFS is a userland file system !   Not a kernel file system !   So most linux programs can not read/write data to HDFS !   We use ‘hdfs’ command line utils
  • 31. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NFS Access to HDFS !   HDFS supports NFS protocol starting with v2 !   NFS is done via gateway machine
  • 32. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. NEXT !   NameNode High Availability !   Federation !   Snapshots !   NFS !   Improved IO 32
  • 33. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS Improved IO !   Lots of performance fixes from v1 à v2 !   Quick comparison !   Multi threaded random-read !   HDFS v1 : 264 MB/sec !   HDFS v2 : 1395 MB /sec ( 5x !) Source : http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache- hadoop-forum
  • 34. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. V2 Features ! HDFS !   Processing !   YARN
  • 35. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce V1 !   MRV1 proved itself as a reliable batch processing framework! !   One Job Tracker (master) and many task tracker (workers)
  • 36. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce Architecture 36 Job Tracker Task Tracker Task TrackerTask TrackerTask Tracker
  • 37. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MRV1 Limitations !   Only supports one programming paradigm !   Batch processing !   Alternate processing is hard to (or not possible) implement on top of MRV1 !   Real time processing !   In-memory data
  • 38. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MRV1 Limitations !   Single Job Tracker (JT) à single point of failure !   JT Failure kills all running jobs (and queued jobs) !   JT started hit scalability limitations for very large clusters !   4,000 nodes
  • 39. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Looking Ahead HDFS MRV1 1) Processing 2) Resource management HDFS YARN (resource management) mapreduce other Hadoop v1 Hadoop v2
  • 40. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. 40
  • 41. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Yarn !   MRV1 did !   Resource Management !   And Processing !   Separate both out !   Yarn for resource management !   Mapreduce / other frameworks for processing !   Now mapreduce is ‘just another app’
  • 42. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Yarn Architecture
  • 43. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. YARN Architecture !   resource manager : manages the resource for entire cluster !   node manager : manages resources a single node !   Containers : resource buckets ( 2 cpu + 8 G RAM) !   application masters : one for each application !   batch mapreduce, storm …etc !   Manages application scheduling and execution
  • 44. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Adoption of YARN !   Standard on Hadoop v2 !   Already running at Yahoo at scale !   Lot of applications are already moving to YARN architecture
  • 45. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Apps on Yarn HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Graph (giraph) realtime (hbase)
  • 46. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Apps on YARN !   Storm : real time event processing !   Giraph : graph processing (in memory) !   Spark : in-memory, iterative processing !   Hbase
  • 47. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce on YARN !   MapReduce is NOT going anywhere !   Works very well for batch processing !   Proven !   Lots of code out there !   No more single JobTracker !   Each MapReduce job runs an Application !   So failure one AppMaster only causes that job to fail !   Other jobs are insulated !   Better performance !   MR jobs scale / utilize cluster better in Yarn (1.5 x – 2x ) (c) ElephantScale.com, 2014
  • 48. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce on YARN (c) ElephantScale.com, 2014
  • 49. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Writing A YARN Application ! http://hadoop.apache.org/docs/stable/hadoop-yarn/ hadoop-yarn-site/WritingYarnApplications.html
  • 50. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. So Which Hadoop Should I Use? !   If you are starting now… !   Hadoop 2 !   Already using Hadoop 1 !   Worth the upgrade (new features / performance) !   How do I migrate? !   Recommended : Standup a separate v2 cluster and migrate data over !   In place update? (yeek!)
  • 51. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Distributions Distribution Hadoop v1 Hadoop v2 Cloudera CDH 3.x / CDH 4.x CDH 5.x Horton Works HDP 1.x HDP 2.x Pivotal HD
  • 52. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Future… !   HDFS !   Mirroring across data centers !   Work well with SSD (solid state drives / flash drives) !   YARN !   Better containers (not just JVMs) !   Performance !   Make Resource Manager HA
  • 53. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Thanks & Questions?
  • 54. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Attribution & Feedback 54 Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org The SNIA Education Committee thanks the following individuals for their contributions to this Tutorial. Authorship History Sujee Maniyam (Sept 2014) Additional Contributors Joseph White : Review & Feedback
  • 55. Hadoop 2 : New and Noteworthy © 2013 Storage Networking Industry Association. All Rights Reserved. Backup Slides 55