SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved. 1
Hadoop 3 is coming — what’s new and
what’s next?
© Cloudera, Inc. All rights reserved. 2
About Wei-Chiu
Apache Hadoop committer
Software Engineer, Cloudera
© Cloudera, Inc. All rights reserved. 3
Agenda
The Problem
What is Hadoop
Major Hadoop 3 Features
What’s Next?
© Cloudera, Inc. All rights reserved. 4
- Anne Wojcicki
“Data helps solve problems”
© Cloudera, Inc. All rights reserved. 5
Big Data - 3Vs
Volume Velocity Variety
© Cloudera, Inc. All rights reserved. 6
Apache Hadoop
The de facto Big Data Analytics platform
A distributed framework to support large scale computation on commodity
hardware
‱ Petabyte+ storage, 1,000+ compute nodes
‱ Inspired by Google
‱ Originally developed by Yahoo!, donated to Apache Software Foundation.
‱ Open source :)
‱ 183 committers, thousands contributors
© Cloudera, Inc. All rights reserved. 7
Apache Hadoop
© Cloudera, Inc. All rights reserved. 8
Cloudera
Commercializes Hadoop* technology
Open source, open culture
CDH - Cloudera’s Distribution for Hadoop
‱ Platform. Open source
Cloudera Manager (CM), Cloudera Navigator, Key Trustee
‱ Cluster management, monitoring. Proprietary
*Hadoop and its associated projects
© Cloudera, Inc. All rights reserved. 9
Hadoop Ecosystem
© Cloudera, Inc. All rights reserved. 10
Storage: reduce storage cost
Compute: much larger cluster
Well, data itself is a problem ...
High availability
High performance
Seamless experience
More cloud usage
Ease of development
Clusters becoming larger More enterprise adoption New applications
© Cloudera, Inc. All rights reserved. 11
HDFS Erasure Coding
© Cloudera, Inc. All rights reserved. 12
Hadoop Distributed File System
3x replication
write
read
A fault tolerant, highly scalable storage system
POSIX semantics
Security — user authentication, authorization, at-rest encryption, transport
encryption NameNode DataNode
© Cloudera, Inc. All rights reserved. 13
Advantage
‱ Failure tolerant
But
‱ 3x storage cost
‱ 3x datacenter space
‱ 3x power consumption
Hadoop Distributed File System
How to reduce storage overhead?
© Cloudera, Inc. All rights reserved. 14
‱Parity bit
‱XOR
‱If X is lost, X can be reconstructed using Y and X ^ Y
‱50% overhead ((3-2)/2)
‱Can tolerate one failure
‱Reed–Solomon
‱RS(k,m) tolerates m failures in k data cells.
‱XOR = RS(2,1)
Erasure Coding 101
© Cloudera, Inc. All rights reserved. 15
Reed-Solomon
‱ Compute parity bits for
redundancy
‱ Blocks can be reconstructed after
failures
‱ Configurable durability v.s.
storage overhead
RS(6,3)
‱ = 50% storage overhead
‱ (9-6)/6
Erasure Coding 101
K blocks
encode
M parity blocks
© Cloudera, Inc. All rights reserved. 16
HDFS-EC: RS(6,3)
write
read
block
strip 1
strip 2
strip 3
strip 4
strip 5
strip 6
parity 1
parity 2
parity 3
© Cloudera, Inc. All rights reserved. 17
HDFS-EC: Failure Handling
© Cloudera, Inc. All rights reserved. 18
© Cloudera, Inc. All rights reserved. 19
YARN Federation
© Cloudera, Inc. All rights reserved. 20
YARN
A resource management framework for Hadoop clusters
● Highly scalable, 4000 - 8000 nodes in production
● Hive, Oozie, Spark, 

● HBase
© Cloudera, Inc. All rights reserved. 21
YARN
Resource
Manager
Application
Master
Client Node
Manager
© Cloudera, Inc. All rights reserved. 22
YARN Federation
Developed by Microsoft
Extreme scale
● 100,000 compute nodes
● Resource Manager becomes the bottleneck
© Cloudera, Inc. All rights reserved. 23
YARN Federation
Application
Master
Application
Master
RM Proxy
RM 1
RM 2
© Cloudera, Inc. All rights reserved. 24
YARN Timeline Service v2
© Cloudera, Inc. All rights reserved. 25
Job History Server
●Keeps track of job progress
● Collect or retrieve information of MapReduce jobs
●Extensibility
●MR only
●Usability
●No YARN level events
●Metrics can only be retrieved after job terminates
© Cloudera, Inc. All rights reserved. 26
Application Timeline server v2
● Development led by Twitter
● Usability
○ Flow: logical group of applications
● Scalability
● HBase
● Use cases
● Analyze application performance.
● Cluster capacity planning.
© Cloudera, Inc. All rights reserved. 27
Use HBase for storage
Use cases:
● Analyze application
performance.
● Cluster capacity planning.
ASTv2 Architecture
© Cloudera, Inc. All rights reserved. 28
HDFS Multi Standby NameNodes
© Cloudera, Inc. All rights reserved. 29
NameNode High Availability
Active
NameNode
Standby
NameNode
Journal
Node
Journal
Node
Journal
Node
Quorum
Client
Upload
fsimage
© Cloudera, Inc. All rights reserved. 30
Contributed by Salesforce.
Multiple Standby NameNode
Active
NameNode
Standby
NameNode
Journal
Node
Journal
Node
Journal
Node
Quorum
Client
Standby
NameNode
Upload
fsimage
Upload
fsimage
© Cloudera, Inc. All rights reserved. 34
Classpath Isolation
© Cloudera, Inc. All rights reserved. 35
Dependency Hell
© Cloudera, Inc. All rights reserved. 36
Dependency Hell
Hadoop was not initially designed as foundation of many applications.
● More applications depending on Hadoop
● harder for Hadoop to upgrade dependency libraries.
● Potential risk to break existing applications
● Increase exposure to security vulnerabilities
Classpath Isolation
● Separate client-side classpath from server-side
© Cloudera, Inc. All rights reserved. 37
Cloud
© Cloudera, Inc. All rights reserved. 38
● Cloud connectors
○ Microsoft Azure Data Lake filesystem
○ Aliyun Object Storage Service
Other features
© Cloudera, Inc. All rights reserved. 39
Misc.
© Cloudera, Inc. All rights reserved. 40
● Shell script rewrite
● Requires Java 8
● Server ports
● Remove legacy features
○ S3 file system → S3A (recommended) or S3N
○ Hftp → webhdfs/httpfs
○ Bookkeeper Journal Manager → Quorum Journal Manager
Other features and incompatibility
© Cloudera, Inc. All rights reserved. 41
What’s next?
© Cloudera, Inc. All rights reserved. 42
Developers
‱ Use it early, test it early and file bug reports.
Administrators
‱ Test upgradability
Users
‱ Expect better user experience.
Now what?
Alpha 1
2016/09 2016/12
Alpha 2 Beta 1
2017/0?
GA
?
CDH6
Hadoop 3
Timeline
2017/01
Alpha 3
2017/0?
© Cloudera, Inc. All rights reserved. 43
● We don’t know yet.
● Ozone (HDFS-7240)
○ Object store for HDFS
● HDFS over cloud (HDFS-9806)
● Emerging applications and use cases
● Docker
● Deep learning
● Hardware Trend
○ Cloud storage
○ Faster ethernet (40GBps), high density (> 100TB) storage node
○ Memory technology
○ Locality will not be a deciding factor.
Future? Hadoop 4?
© Cloudera, Inc. All rights reserved. 44
Ozone (HDFS-7240)
Status quo
● NameNode is becoming a bottleneck
● A general file system may not suit the
specific need of an application
Solution
● Split HDFS namespace into blob
stores
© Cloudera, Inc. All rights reserved. 45
HDFS over Cloud (HDFS-9806)
Use case
● Use HDFS for temporary data
● Use cloud for permanent storage
The problem
● Data management
● Consistency
Solution
● HDFS as metastore and cache
● Cloud as backend data store
© Cloudera, Inc. All rights reserved. 46
Ask Bigger Questions
© Cloudera, Inc. All rights reserved. 47
‱ Introduction to HDFS Erasure Coding in Apache Hadoop
‱ Enable YARN RM scale out via federation using multiple RM's
‱ Application Timeline Server - Past, Present and Future
‱ HDFS-6440 Support more than 2 NameNodes
‱ How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop
References

More Related Content

PPTX
Apache Hadoop 3 updates with migration story
PDF
Apache Hadoop 3
PDF
Difference between hadoop 2 vs hadoop 3
PPTX
To The Cloud and Back: A Look At Hybrid Analytics
PPTX
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
PPTX
How the Internet of Things are Turning the Internet Upside Down
PPTX
A New "Sparkitecture" for modernizing your data warehouse
PDF
Hadoop 3.0 - Revolution or evolution?
Apache Hadoop 3 updates with migration story
Apache Hadoop 3
Difference between hadoop 2 vs hadoop 3
To The Cloud and Back: A Look At Hybrid Analytics
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
How the Internet of Things are Turning the Internet Upside Down
A New "Sparkitecture" for modernizing your data warehouse
Hadoop 3.0 - Revolution or evolution?

What's hot (20)

PPTX
Data Protection in Hybrid Enterprise Data Lake Environment
PPTX
Hadoop Operations - Best Practices from the Field
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Welcome to Hadoop2Land!
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PDF
Hadoop Operations - Best practices from the field
PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
2013 July 23 Toronto Hadoop User Group Hive Tuning
PPTX
Hadoop from Hive with Stinger to Tez
PDF
Application architectures with hadoop – big data techcon 2014
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
Hadoop Infrastructure @Uber Past, Present and Future
PPTX
Evolving HDFS to a Generalized Storage Subsystem
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
Moving towards enterprise ready Hadoop clusters on the cloud
PDF
dplyr Interfaces to Large-Scale Data
Data Protection in Hybrid Enterprise Data Lake Environment
Hadoop Operations - Best Practices from the Field
Hadoop meets Agile! - An Agile Big Data Model
Welcome to Hadoop2Land!
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Hadoop Operations - Best practices from the field
Apache Hadoop 3.0 What's new in YARN and MapReduce
HDFS Tiered Storage: Mounting Object Stores in HDFS
Hadoop 3.0 - Revolution or evolution?
HDFS Tiered Storage: Mounting Object Stores in HDFS
2013 July 23 Toronto Hadoop User Group Hive Tuning
Hadoop from Hive with Stinger to Tez
Application architectures with hadoop – big data techcon 2014
Ingest and Stream Processing - What will you choose?
Data Wrangling and Oracle Connectors for Hadoop
Hadoop Infrastructure @Uber Past, Present and Future
Evolving HDFS to a Generalized Storage Subsystem
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
Moving towards enterprise ready Hadoop clusters on the cloud
dplyr Interfaces to Large-Scale Data
Ad

Viewers also liked (7)

PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PDF
Coca-Cola East Japan - hadoop summit 2016
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PPTX
Hadoop 3 in a Nutshell
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Hadoop 3 @ Hadoop Summit San Jose 2017
Coca-Cola East Japan - hadoop summit 2016
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop 3 in a Nutshell
Unleashing the Power of Apache Atlas with Apache Ranger
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Ad

Similar to Hadoop 3 (2017 hadoop taiwan workshop) (20)

PDF
Building a Hadoop Data Warehouse with Impala
 
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Yarns About Yarn
PDF
Introduction to HBase - NoSqlNow2015
PPTX
Hadoop Storage in the Cloud Native Era
PDF
Kudu austin oct 2015.pptx
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Data Science and CDSW
PPTX
Empower Hive with Spark
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PDF
Hadoop Operations for Production Systems (Strata NYC)
PDF
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PPTX
Introduction to Apache Kudu
PPTX
SFHUG Kudu Talk
PPTX
Spark One Platform Webinar
PDF
Kudu: Fast Analytics on Fast Data
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Yarns About Yarn
Introduction to HBase - NoSqlNow2015
Hadoop Storage in the Cloud Native Era
Kudu austin oct 2015.pptx
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Introduction to Kudu - StampedeCon 2016
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Science and CDSW
Empower Hive with Spark
Leveraging the cloud for analytics and machine learning 1.29.19
Hadoop Operations for Production Systems (Strata NYC)
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introduction to Apache Kudu
SFHUG Kudu Talk
Spark One Platform Webinar
Kudu: Fast Analytics on Fast Data

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
history of c programming in notes for students .pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPT
Introduction Database Management System for Course Database
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Online Work Permit System for Fast Permit Processing
Understanding Forklifts - TECH EHS Solution
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 41
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
history of c programming in notes for students .pptx
top salesforce developer skills in 2025.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Upgrade and Innovation Strategies for SAP ERP Customers
ISO 45001 Occupational Health and Safety Management System
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
How to Choose the Right IT Partner for Your Business in Malaysia
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction Database Management System for Course Database
Odoo POS Development Services by CandidRoot Solutions
Online Work Permit System for Fast Permit Processing

Hadoop 3 (2017 hadoop taiwan workshop)

  • 1. © Cloudera, Inc. All rights reserved. 1 Hadoop 3 is coming — what’s new and what’s next?
  • 2. © Cloudera, Inc. All rights reserved. 2 About Wei-Chiu Apache Hadoop committer Software Engineer, Cloudera
  • 3. © Cloudera, Inc. All rights reserved. 3 Agenda The Problem What is Hadoop Major Hadoop 3 Features What’s Next?
  • 4. © Cloudera, Inc. All rights reserved. 4 - Anne Wojcicki “Data helps solve problems”
  • 5. © Cloudera, Inc. All rights reserved. 5 Big Data - 3Vs Volume Velocity Variety
  • 6. © Cloudera, Inc. All rights reserved. 6 Apache Hadoop The de facto Big Data Analytics platform A distributed framework to support large scale computation on commodity hardware ‱ Petabyte+ storage, 1,000+ compute nodes ‱ Inspired by Google ‱ Originally developed by Yahoo!, donated to Apache Software Foundation. ‱ Open source :) ‱ 183 committers, thousands contributors
  • 7. © Cloudera, Inc. All rights reserved. 7 Apache Hadoop
  • 8. © Cloudera, Inc. All rights reserved. 8 Cloudera Commercializes Hadoop* technology Open source, open culture CDH - Cloudera’s Distribution for Hadoop ‱ Platform. Open source Cloudera Manager (CM), Cloudera Navigator, Key Trustee ‱ Cluster management, monitoring. Proprietary *Hadoop and its associated projects
  • 9. © Cloudera, Inc. All rights reserved. 9 Hadoop Ecosystem
  • 10. © Cloudera, Inc. All rights reserved. 10 Storage: reduce storage cost Compute: much larger cluster Well, data itself is a problem ... High availability High performance Seamless experience More cloud usage Ease of development Clusters becoming larger More enterprise adoption New applications
  • 11. © Cloudera, Inc. All rights reserved. 11 HDFS Erasure Coding
  • 12. © Cloudera, Inc. All rights reserved. 12 Hadoop Distributed File System 3x replication write read A fault tolerant, highly scalable storage system POSIX semantics Security — user authentication, authorization, at-rest encryption, transport encryption NameNode DataNode
  • 13. © Cloudera, Inc. All rights reserved. 13 Advantage ‱ Failure tolerant But ‱ 3x storage cost ‱ 3x datacenter space ‱ 3x power consumption Hadoop Distributed File System How to reduce storage overhead?
  • 14. © Cloudera, Inc. All rights reserved. 14 ‱Parity bit ‱XOR ‱If X is lost, X can be reconstructed using Y and X ^ Y ‱50% overhead ((3-2)/2) ‱Can tolerate one failure ‱Reed–Solomon ‱RS(k,m) tolerates m failures in k data cells. ‱XOR = RS(2,1) Erasure Coding 101
  • 15. © Cloudera, Inc. All rights reserved. 15 Reed-Solomon ‱ Compute parity bits for redundancy ‱ Blocks can be reconstructed after failures ‱ Configurable durability v.s. storage overhead RS(6,3) ‱ = 50% storage overhead ‱ (9-6)/6 Erasure Coding 101 K blocks encode M parity blocks
  • 16. © Cloudera, Inc. All rights reserved. 16 HDFS-EC: RS(6,3) write read block strip 1 strip 2 strip 3 strip 4 strip 5 strip 6 parity 1 parity 2 parity 3
  • 17. © Cloudera, Inc. All rights reserved. 17 HDFS-EC: Failure Handling
  • 18. © Cloudera, Inc. All rights reserved. 18
  • 19. © Cloudera, Inc. All rights reserved. 19 YARN Federation
  • 20. © Cloudera, Inc. All rights reserved. 20 YARN A resource management framework for Hadoop clusters ● Highly scalable, 4000 - 8000 nodes in production ● Hive, Oozie, Spark, 
 ● HBase
  • 21. © Cloudera, Inc. All rights reserved. 21 YARN Resource Manager Application Master Client Node Manager
  • 22. © Cloudera, Inc. All rights reserved. 22 YARN Federation Developed by Microsoft Extreme scale ● 100,000 compute nodes ● Resource Manager becomes the bottleneck
  • 23. © Cloudera, Inc. All rights reserved. 23 YARN Federation Application Master Application Master RM Proxy RM 1 RM 2
  • 24. © Cloudera, Inc. All rights reserved. 24 YARN Timeline Service v2
  • 25. © Cloudera, Inc. All rights reserved. 25 Job History Server ●Keeps track of job progress ● Collect or retrieve information of MapReduce jobs ●Extensibility ●MR only ●Usability ●No YARN level events ●Metrics can only be retrieved after job terminates
  • 26. © Cloudera, Inc. All rights reserved. 26 Application Timeline server v2 ● Development led by Twitter ● Usability ○ Flow: logical group of applications ● Scalability ● HBase ● Use cases ● Analyze application performance. ● Cluster capacity planning.
  • 27. © Cloudera, Inc. All rights reserved. 27 Use HBase for storage Use cases: ● Analyze application performance. ● Cluster capacity planning. ASTv2 Architecture
  • 28. © Cloudera, Inc. All rights reserved. 28 HDFS Multi Standby NameNodes
  • 29. © Cloudera, Inc. All rights reserved. 29 NameNode High Availability Active NameNode Standby NameNode Journal Node Journal Node Journal Node Quorum Client Upload fsimage
  • 30. © Cloudera, Inc. All rights reserved. 30 Contributed by Salesforce. Multiple Standby NameNode Active NameNode Standby NameNode Journal Node Journal Node Journal Node Quorum Client Standby NameNode Upload fsimage Upload fsimage
  • 31. © Cloudera, Inc. All rights reserved. 34 Classpath Isolation
  • 32. © Cloudera, Inc. All rights reserved. 35 Dependency Hell
  • 33. © Cloudera, Inc. All rights reserved. 36 Dependency Hell Hadoop was not initially designed as foundation of many applications. ● More applications depending on Hadoop ● harder for Hadoop to upgrade dependency libraries. ● Potential risk to break existing applications ● Increase exposure to security vulnerabilities Classpath Isolation ● Separate client-side classpath from server-side
  • 34. © Cloudera, Inc. All rights reserved. 37 Cloud
  • 35. © Cloudera, Inc. All rights reserved. 38 ● Cloud connectors ○ Microsoft Azure Data Lake filesystem ○ Aliyun Object Storage Service Other features
  • 36. © Cloudera, Inc. All rights reserved. 39 Misc.
  • 37. © Cloudera, Inc. All rights reserved. 40 ● Shell script rewrite ● Requires Java 8 ● Server ports ● Remove legacy features ○ S3 file system → S3A (recommended) or S3N ○ Hftp → webhdfs/httpfs ○ Bookkeeper Journal Manager → Quorum Journal Manager Other features and incompatibility
  • 38. © Cloudera, Inc. All rights reserved. 41 What’s next?
  • 39. © Cloudera, Inc. All rights reserved. 42 Developers ‱ Use it early, test it early and file bug reports. Administrators ‱ Test upgradability Users ‱ Expect better user experience. Now what? Alpha 1 2016/09 2016/12 Alpha 2 Beta 1 2017/0? GA ? CDH6 Hadoop 3 Timeline 2017/01 Alpha 3 2017/0?
  • 40. © Cloudera, Inc. All rights reserved. 43 ● We don’t know yet. ● Ozone (HDFS-7240) ○ Object store for HDFS ● HDFS over cloud (HDFS-9806) ● Emerging applications and use cases ● Docker ● Deep learning ● Hardware Trend ○ Cloud storage ○ Faster ethernet (40GBps), high density (> 100TB) storage node ○ Memory technology ○ Locality will not be a deciding factor. Future? Hadoop 4?
  • 41. © Cloudera, Inc. All rights reserved. 44 Ozone (HDFS-7240) Status quo ● NameNode is becoming a bottleneck ● A general file system may not suit the specific need of an application Solution ● Split HDFS namespace into blob stores
  • 42. © Cloudera, Inc. All rights reserved. 45 HDFS over Cloud (HDFS-9806) Use case ● Use HDFS for temporary data ● Use cloud for permanent storage The problem ● Data management ● Consistency Solution ● HDFS as metastore and cache ● Cloud as backend data store
  • 43. © Cloudera, Inc. All rights reserved. 46 Ask Bigger Questions
  • 44. © Cloudera, Inc. All rights reserved. 47 ‱ Introduction to HDFS Erasure Coding in Apache Hadoop ‱ Enable YARN RM scale out via federation using multiple RM's ‱ Application Timeline Server - Past, Present and Future ‱ HDFS-6440 Support more than 2 NameNodes ‱ How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop References

Editor's Notes

  • #5: We live in information era. We live in a big data era.
  • #6: More important than 3 Vs, is analyze the data, and get extra value out of it. Leverage big data to answer new questions.
  • #7: Compare to MPI: data locality, embarrasingly parallel, fault tolerant, commodity hardware Modeled after Google’s Google File System and MapReduce.
  • #8: Hadoop cluster at Yahoo
  • #9: Health care, telecom, retail, tech, media, government,
  • #10: This is Cloudera’s view of the Hadoop ecosystem. Of course, other Hadoop distributions or platform distributions may have other views of the system. (Confluence, Databricks, Datastax
  • #11: I am not talking about big data application, but big data management problem.
  • #13: Use commodity hardware means failure is a norm. Must handle it. HDFS is a distributed file system -- high throughput Traditionally replicated block architecture
  • #14: As cluster grows, there are more cold data. These cold data doesn’t need high through. They just need redundancy.
  • #16: http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/
  • #17: Traditionally replicated block architecture
  • #19: The encoding/deciding of EC requires hardware acceleration ISA-L (4x speedup). Intel folks has contributed significantly in this project. Compare Facebook’s HDFS-RAID coder, pure Java coder, ISA-L coder
  • #21: Think of operating system kernel scheduler that runs multiple processes on multicore. YARN-2915 https://issues.apache.org/jira/browse/YARN-2915
  • #22: Think of operating system kernel scheduler that runs multiple processes on multicore. YARN-2915 https://issues.apache.org/jira/browse/YARN-2915
  • #25: http://www.slideshare.net/NaganarasimhaGarla/application-timeline-server-past-present-and-future-53027238
  • #26: Each application has its own JHS. Information is not shared between each JHS. It is hard to trace an oozie workflow, say it starts a spark job, runs hdfs command, start a yarn app. Very little visbility of global picture. metrics is stored in HDFS. Can only be retrieved when job terminates.
  • #29: Single point of failure HDF-6440, contributed by Salesforce
  • #31: Don’t go more than 3~5 nematodes in practice.
  • #39: s3n will not be maintained other than security fix
  • #41: s3n will not be maintained other than security fix
  • #44: Spend 10 minutes Hadoop was based on the assumption to move compute to data 3D XPoint memory technology Virtualized compute, storage and networks
  • #45: Hortonworks
  • #46: Microsoft
  • #47: Ask bigger questions