Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 1
Hadoop 3 is coming — what’s new and
what’s next?

About Wei-Chiu
Apache Hadoop committer
Software Engineer, Cloudera

Agenda
The Problem
What is Hadoop
Major Hadoop 3 Features
What’s Next?

- Anne Wojcicki
“Data helps solve problems”

Big Data - 3Vs
Volume Velocity Variety

Apache Hadoop
The de facto Big Data Analytics platform
A distributed framework to support large scale computation on commodity
hardware
• Petabyte+ storage, 1,000+ compute nodes
• Inspired by Google
• Originally developed by Yahoo!, donated to Apache Software Foundation.
• Open source :)
• 183 committers, thousands contributors

Apache Hadoop

Cloudera
Commercializes Hadoop* technology
Open source, open culture
CDH - Cloudera’s Distribution for Hadoop
• Platform. Open source
Cloudera Manager (CM), Cloudera Navigator, Key Trustee
• Cluster management, monitoring. Proprietary
*Hadoop and its associated projects

Hadoop Ecosystem

Storage: reduce storage cost
Compute: much larger cluster
Well, data itself is a problem ...
High availability
High performance
Seamless experience
More cloud usage
Ease of development
Clusters becoming larger More enterprise adoption New applications

HDFS Erasure Coding

Hadoop Distributed File System
3x replication
write
read
A fault tolerant, highly scalable storage system
POSIX semantics
Security — user authentication, authorization, at-rest encryption, transport
encryption NameNode DataNode

Advantage
• Failure tolerant
But
• 3x storage cost
• 3x datacenter space
• 3x power consumption
Hadoop Distributed File System
How to reduce storage overhead?

•Parity bit
•XOR
•If X is lost, X can be reconstructed using Y and X ^ Y
•50% overhead ((3-2)/2)
•Can tolerate one failure
•Reed–Solomon
•RS(k,m) tolerates m failures in k data cells.
•XOR = RS(2,1)
Erasure Coding 101

Reed-Solomon
• Compute parity bits for
redundancy
• Blocks can be reconstructed after
failures
• Configurable durability v.s.
storage overhead
RS(6,3)
• = 50% storage overhead
• (9-6)/6
Erasure Coding 101
K blocks
encode
M parity blocks

HDFS-EC: RS(6,3)
write
read
block
strip 1
strip 2
strip 3
strip 4
strip 5
strip 6
parity 1
parity 2
parity 3

HDFS-EC: Failure Handling

YARN Federation

YARN
A resource management framework for Hadoop clusters
● Highly scalable, 4000 - 8000 nodes in production
● Hive, Oozie, Spark, …
● HBase

YARN
Resource
Manager
Application
Master
Client Node
Manager

YARN Federation
Developed by Microsoft
Extreme scale
● 100,000 compute nodes
● Resource Manager becomes the bottleneck

YARN Federation
Application
Master
Application
Master
RM Proxy
RM 1
RM 2

YARN Timeline Service v2

Job History Server
●Keeps track of job progress
● Collect or retrieve information of MapReduce jobs
●Extensibility
●MR only
●Usability
●No YARN level events
●Metrics can only be retrieved after job terminates

Application Timeline server v2
● Development led by Twitter
● Usability
○ Flow: logical group of applications
● Scalability
● HBase
● Use cases
● Analyze application performance.
● Cluster capacity planning.

Use HBase for storage
Use cases:
● Analyze application
performance.
● Cluster capacity planning.
ASTv2 Architecture

HDFS Multi Standby NameNodes

NameNode High Availability
Active
NameNode
Standby
NameNode
Journal
Node
Journal
Node
Journal
Node
Quorum
Client
Upload
fsimage

Contributed by Salesforce.
Multiple Standby NameNode
Active
NameNode
Standby
NameNode
Journal
Node
Journal
Node
Journal
Node
Quorum
Client
Standby
NameNode
Upload
fsimage
Upload
fsimage

Classpath Isolation

Dependency Hell

Dependency Hell
Hadoop was not initially designed as foundation of many applications.
● More applications depending on Hadoop
● harder for Hadoop to upgrade dependency libraries.
● Potential risk to break existing applications
● Increase exposure to security vulnerabilities
Classpath Isolation
● Separate client-side classpath from server-side

Cloud

● Cloud connectors
○ Microsoft Azure Data Lake filesystem
○ Aliyun Object Storage Service
Other features

Misc.

● Shell script rewrite
● Requires Java 8
● Server ports
● Remove legacy features
○ S3 file system → S3A (recommended) or S3N
○ Hftp → webhdfs/httpfs
○ Bookkeeper Journal Manager → Quorum Journal Manager
Other features and incompatibility

What’s next?

Developers
• Use it early, test it early and file bug reports.
Administrators
• Test upgradability
Users
• Expect better user experience.
Now what?
Alpha 1
2016/09 2016/12
Alpha 2 Beta 1
2017/0?
GA
?
CDH6
Hadoop 3
Timeline
2017/01
Alpha 3
2017/0?

● We don’t know yet.
● Ozone (HDFS-7240)
○ Object store for HDFS
● HDFS over cloud (HDFS-9806)
● Emerging applications and use cases
● Docker
● Deep learning
● Hardware Trend
○ Cloud storage
○ Faster ethernet (40GBps), high density (> 100TB) storage node
○ Memory technology
○ Locality will not be a deciding factor.
Future? Hadoop 4?

Ozone (HDFS-7240)
Status quo
● NameNode is becoming a bottleneck
● A general file system may not suit the
specific need of an application
Solution
● Split HDFS namespace into blob
stores

HDFS over Cloud (HDFS-9806)
Use case
● Use HDFS for temporary data
● Use cloud for permanent storage
The problem
● Data management
● Consistency
Solution
● HDFS as metastore and cache
● Cloud as backend data store

Ask Bigger Questions

• Introduction to HDFS Erasure Coding in Apache Hadoop
• Enable YARN RM scale out via federation using multiple RM's
• Application Timeline Server - Past, Present and Future
• HDFS-6440 Support more than 2 NameNodes
• How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop
References

Hadoop 3 (2017 hadoop taiwan workshop)

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Hadoop 3 (2017 hadoop taiwan workshop) (20)

Recently uploaded (20)

Hadoop 3 (2017 hadoop taiwan workshop)

Editor's Notes