SlideShare a Scribd company logo
Apache Hadoop 0.22
and Other Versions
Konstantin V Shvachko
Principal Hadoop Architect, eBay
IBM Karmasphere Twitter
February – March, 2012
eBay Inc. confidential
Apache Hadoop Ecosystem
• Hadoop Core
– Common – communication and user facing APIs
– HDFS – distributed file system
– MapReduce – distributed computation framework
• Pig – dataflow language
• Hive – data warehouse, SQL
• Zookeeper – distributed coordination service
• HBase – columnar store
• Oozie – complex job workflow
• eBay Specific
– Cascading
– Lzo compression
2
eBay Inc. confidential
Hadoop Versioning
• Straight line from 0.1 to 0.20
• Fanned out starting from 0.20.2
• Multiple distributions in 2010 based on 0.20
– Apache, Y, CDH, FB
– More today
• Focus on Apache Releases
– Release 0.20.2 2010-02-16
– Release 0.21.0 2010-08-13
– Release 0.20.203.0 2011-05-11 Security Stable
– Release 0.20.204.0 2011-09-05 Improvements
– Release 0.20.205.0 2011-10-17 HBase support
• Genealogy of elephants
3
eBay Inc. confidential4
eBay Inc. confidential
Major Branches
• Hadoop 1.0.0 (security branch) 2011-12-27
– Rename of 0.20.205
– Beta
• Hadoop 0.22.0 2011-12-10
– Continuation of 0.21.0
– Beta
• Hadoop 0.23.0 2011-11-11
– Fedaration – static partitioning of HDFS namespace
– Yarn – new implementation of MapReduce
– Scalability
– Alpha
• 2011 – record number of major releases!
• No unifying release, containing all the good features
5
eBay Inc. confidential
Hadoop 0.22 Branch
• Branched 2010-11-17
• Released 2011-12-10
• Many events in-between
• RM role – started in August 2011
• Stabilization
–Hadoop Platform team, eBay
–Many contributors from the community
6
eBay Inc. confidential
Features HDFS - 0.22
• New implementation of file append
• HBase support with hflush and hsync
• Symbolic links
• BackupNode and CheckpointNode
• DataNodes tolerate single disk failure. Disk-fail-in-place
• File concatenation
• SLive test
• Sticky bit
• Offline Image Viewer
7
eBay Inc. confidential
Features MapReduce - 0.22
• Hierarchical job queues
• Job limits per queue / pool
• Dynamically stop / start job queues
• Andvances in new MapReduce API
– Input/Output formats, ChainMapper / ChainReducer
• TaskTracker blacklisting
• DistributedCache sharing
8
eBay Inc. confidential
Features not Supported in Hadoop 0.22.0
Compared to Hadoop 1.0
• Security
– LinuxTaskController removed MAPREDUCE-2767
• Optimizations (operability) of the MapReduce framework
introduced in the Hadoop 0.20.security line of releases
– Limits on per-job JobConf, Counters, StatusReport, Split-Sizes
– User / queue limits on tasks / jobs in the CapacityScheduler
• Disk-fail-in-place – MapReduce part
• JMX-based metrics v2
• Jetty workaround
• CapacityScheduler should assign multiple tasks per heartbeat
• User's task logs filling up local disks on the TaskTrackers
• FairScheduler back-port from trunk
9
eBay Inc. confidential
Not in Hadoop 0.22.0 HDFS Part
• Shortcut a local client reads to a Datanodes files directly
– Important HBase optimization
– Porting is in progress
• WebHDFS: accessing HDFS over HTTP
– New experimental feature, back-ported from trunk
• NameNode startup time
– Handling block reports and missed heartbeats from DataNodes
– The rest is forward ported from 1.0
– More startup improvements in 0.22
10
eBay Inc. confidential
Hadoop 0.23 Features
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
• Yarn: Scalability for MapReduce framework
– Separation of JobTracker functions
1. Job scheduling and resource allocation:
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• “Apache Hadoop: The scalability update” USENIX ;login: June, 2011
11
eBay Inc. confidential
Append and HBase
• Append means
– Reopening of existing files for appending new data
– Replica synchronization after failure
– Consistent view of file data during writing by different clients
– hflush, hsync – guarantee data delivered to DNs and persisted on NN
• First implementation of append in 0.19 HADOOP-1700
– 0.20-append branch
• Redesign of append in 0.21 HDFS-265
• HBase needs hflush and hsync only
• Hadoop 1.0 - HBase support via hflush, hsync
• Hadoop 0.22 – fully functional append, including HBase support
12
eBay Inc. confidential
BackupNode
• BackupNode a read-only NameNode
– Contains all file system metadata: files and directories
excluding block locations
– Can perform NameNode operations that don’t modify namespace
• BN maintains up-to-date in-memory image of file system namespace
always synchronized with the NameNode state
– NameNode streams journal to BackupNode
• BackupNode can create a checkpoint without downloading
checkpoint and journal files from active NameNode
• Intended to evolve into hot HA HDFS-2064
13
eBay Inc. confidential
Hadoop at eBay
• 2011 started with 532-node 5 PB cluster running CDH2
• EBay 0.20.203-based build (Wilma)
– Hadoop 0.20.203 – latest stable Apache release
• HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo
– 500+ users; 2000 jobs per day
• Runs on 1000-node cluster
– 24 PB – capacity, 72 GB RAM / node
• Many smaller clusters
• Stabilization of Hadoop platform based on 0.22
14
eBay Inc. confidential
Testing
• One year of testing by different groups in Hadoop ecosystem
• Extensive testing of append by HBase community
• Fully automated build and certification with BigTop
• Hadoop platform team at eBay
– Extensive stabilization effort starting September
– Most bugs found in 0.22 are also in trunk and 0.23
– All new features tested
– Stress testing
– Reliability testing
• Works with: Pig 0.8, Hive 0.7, custom changes
HBase 0.92, Oozie, open sourced
Zookeeper, Cascading no changes needed
15
eBay Inc. confidential
Testing Tools, Examples
• TeraSort, TestDFSIO, DistCp
• GridMix, Rumen – production job traces
• SLive – adjustable mix of HDFS operations, permanent load
• Upgrade / rollback from 0.20.? and 0.20.203 to 0.22
• Oversubscribed cluster running out of memory
• Loosing racks with running jobs and HBase
– Cluster survived consecutive loss of 4 racks, shrinking to single rack
with HBase still alive and MR jobs completing
• Disk-fail-in-place helps identify bad drives during hardware burn-in
16
eBay Inc. confidential
Benchmarking
• TestDFSIO: 10 GB files (same as 100 GB)
• TeraSort: -5% (scheduler to blame)
• YCSB - same
• Internal eBay applications – same or better
• Lots of tuning: Hadoop, Java, OS, HW
– Gradual improvement of results
17
Throughput
MB/sec
Read Write Append
Hadoop-0.22 100 84 83
0.20 breed 96 66 n/a
eBay Inc. confidential
Good to have for 0.22.1
• Restore Security
• Disk Fail in place for MapReduce
• Optimizations
– Multiple tasks per heartbeat for CapacityScheduler
– CapacityScheduler preemption
• MR job and task limits
• Cluster startup time
• Add HA?
• Merge MR-1.0 into Hadoop 0.22?
18
eBay Inc. confidential
Important
• Works but not 0.20
– Good new features
– Reliability is the first concern
– Performance and missing functionality can be reconstructed
• Community release
– Not distributed / advertized by commercial distributors
– Community involvement important
• Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0
It’s the other way around
– Go to Hadoop 0.22 instead
• Forward-going release progress
– Stop porting new features, start releasing them
19
eBay Inc. confidential
Thank you
20
Hadoop 0.22 Contributions Accepted

More Related Content

PPT
Hadoop 1.x vs 2
PDF
Difference between hadoop 2 vs hadoop 3
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PDF
Hadoop HDFS
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Hadoop Cluster With High Availability
PDF
Hadoop Architecture in Depth
Hadoop 1.x vs 2
Difference between hadoop 2 vs hadoop 3
Hadoop 3.0 - Revolution or evolution?
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hadoop HDFS
HDFS: Hadoop Distributed Filesystem
Hadoop Cluster With High Availability
Hadoop Architecture in Depth

What's hot (20)

ODP
Architecture of Hadoop
PPTX
HDFS Internals
PPTX
Understanding Hadoop
PPT
Meethadoop
PDF
Hadoop operations basic
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Gfs vs hdfs
PPTX
Apache Hadoop YARN 3.x in Alibaba
PDF
Introduction to Hadoop
PPTX
Hadoop architecture meetup
PPTX
Hadoop architecture by ajay
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ODP
Hug Hbase Presentation.
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
Ambari Meetup: NameNode HA
PPTX
Hadoop HDFS Architeture and Design
PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
PPTX
Introduction to hadoop and hdfs
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
Architecture of Hadoop
HDFS Internals
Understanding Hadoop
Meethadoop
Hadoop operations basic
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Gfs vs hdfs
Apache Hadoop YARN 3.x in Alibaba
Introduction to Hadoop
Hadoop architecture meetup
Hadoop architecture by ajay
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
Hug Hbase Presentation.
In-memory Caching in HDFS: Lower Latency, Same Great Taste
Ambari Meetup: NameNode HA
Hadoop HDFS Architeture and Design
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Introduction to hadoop and hdfs
Hadoop 3 @ Hadoop Summit San Jose 2017
Ad

Similar to Apache Hadoop 0.22 and Other Versions (20)

PPTX
Hadoop @ eBay: Past, Present, and Future
PDF
Hbase status quo apache-con europe - nov 2012
PDF
hadoop distributed file systems complete information
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Hadoop And Their Ecosystem
PPTX
hadoop-ecosystem-ppt.pptx
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PPTX
Asbury Hadoop Overview
PPTX
Geo-based content processing using hbase
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PDF
Hadoop Primer
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PDF
Intro to HBase - Lars George
PPTX
Hive - A theoretical overview in Detail.pptx
Hadoop @ eBay: Past, Present, and Future
Hbase status quo apache-con europe - nov 2012
hadoop distributed file systems complete information
Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem
hadoop-ecosystem-ppt.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Asbury Hadoop Overview
Geo-based content processing using hbase
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
Hadoop Primer
Hadoop and Big data in Big data and cloud.pptx
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Intro to HBase - Lars George
Hive - A theoretical overview in Detail.pptx
Ad

More from Konstantin V. Shvachko (6)

PDF
HDFS Selective Wire Encryption
PDF
HDFS for Geographically Distributed File System
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
PDF
HDFS Design Principles
PDF
Distributed Computing with Apache Hadoop: Technology Overview
HDFS Selective Wire Encryption
HDFS for Geographically Distributed File System
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
HDFS Design Principles
Distributed Computing with Apache Hadoop: Technology Overview

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation

Apache Hadoop 0.22 and Other Versions

  • 1. Apache Hadoop 0.22 and Other Versions Konstantin V Shvachko Principal Hadoop Architect, eBay IBM Karmasphere Twitter February – March, 2012
  • 2. eBay Inc. confidential Apache Hadoop Ecosystem • Hadoop Core – Common – communication and user facing APIs – HDFS – distributed file system – MapReduce – distributed computation framework • Pig – dataflow language • Hive – data warehouse, SQL • Zookeeper – distributed coordination service • HBase – columnar store • Oozie – complex job workflow • eBay Specific – Cascading – Lzo compression 2
  • 3. eBay Inc. confidential Hadoop Versioning • Straight line from 0.1 to 0.20 • Fanned out starting from 0.20.2 • Multiple distributions in 2010 based on 0.20 – Apache, Y, CDH, FB – More today • Focus on Apache Releases – Release 0.20.2 2010-02-16 – Release 0.21.0 2010-08-13 – Release 0.20.203.0 2011-05-11 Security Stable – Release 0.20.204.0 2011-09-05 Improvements – Release 0.20.205.0 2011-10-17 HBase support • Genealogy of elephants 3
  • 5. eBay Inc. confidential Major Branches • Hadoop 1.0.0 (security branch) 2011-12-27 – Rename of 0.20.205 – Beta • Hadoop 0.22.0 2011-12-10 – Continuation of 0.21.0 – Beta • Hadoop 0.23.0 2011-11-11 – Fedaration – static partitioning of HDFS namespace – Yarn – new implementation of MapReduce – Scalability – Alpha • 2011 – record number of major releases! • No unifying release, containing all the good features 5
  • 6. eBay Inc. confidential Hadoop 0.22 Branch • Branched 2010-11-17 • Released 2011-12-10 • Many events in-between • RM role – started in August 2011 • Stabilization –Hadoop Platform team, eBay –Many contributors from the community 6
  • 7. eBay Inc. confidential Features HDFS - 0.22 • New implementation of file append • HBase support with hflush and hsync • Symbolic links • BackupNode and CheckpointNode • DataNodes tolerate single disk failure. Disk-fail-in-place • File concatenation • SLive test • Sticky bit • Offline Image Viewer 7
  • 8. eBay Inc. confidential Features MapReduce - 0.22 • Hierarchical job queues • Job limits per queue / pool • Dynamically stop / start job queues • Andvances in new MapReduce API – Input/Output formats, ChainMapper / ChainReducer • TaskTracker blacklisting • DistributedCache sharing 8
  • 9. eBay Inc. confidential Features not Supported in Hadoop 0.22.0 Compared to Hadoop 1.0 • Security – LinuxTaskController removed MAPREDUCE-2767 • Optimizations (operability) of the MapReduce framework introduced in the Hadoop 0.20.security line of releases – Limits on per-job JobConf, Counters, StatusReport, Split-Sizes – User / queue limits on tasks / jobs in the CapacityScheduler • Disk-fail-in-place – MapReduce part • JMX-based metrics v2 • Jetty workaround • CapacityScheduler should assign multiple tasks per heartbeat • User's task logs filling up local disks on the TaskTrackers • FairScheduler back-port from trunk 9
  • 10. eBay Inc. confidential Not in Hadoop 0.22.0 HDFS Part • Shortcut a local client reads to a Datanodes files directly – Important HBase optimization – Porting is in progress • WebHDFS: accessing HDFS over HTTP – New experimental feature, back-ported from trunk • NameNode startup time – Handling block reports and missed heartbeats from DataNodes – The rest is forward ported from 1.0 – More startup improvements in 0.22 10
  • 11. eBay Inc. confidential Hadoop 0.23 Features • HDFS Federation – Independent NameNodes sharing a common pool of DataNodes – Cluster is a family of volumes with shared block storage layer – User sees volumes as isolated file systems – ViewFS: the client-side mount table – Federated approach provides a static partitioning of the federated namespace • Yarn: Scalability for MapReduce framework – Separation of JobTracker functions 1. Job scheduling and resource allocation: • Fundamentally centralized 2. Job monitoring and job life-cycle coordination • Delegate coordination of different jobs to other nodes – Dynamic partitioning of cluster resources: no fixed slots • “Apache Hadoop: The scalability update” USENIX ;login: June, 2011 11
  • 12. eBay Inc. confidential Append and HBase • Append means – Reopening of existing files for appending new data – Replica synchronization after failure – Consistent view of file data during writing by different clients – hflush, hsync – guarantee data delivered to DNs and persisted on NN • First implementation of append in 0.19 HADOOP-1700 – 0.20-append branch • Redesign of append in 0.21 HDFS-265 • HBase needs hflush and hsync only • Hadoop 1.0 - HBase support via hflush, hsync • Hadoop 0.22 – fully functional append, including HBase support 12
  • 13. eBay Inc. confidential BackupNode • BackupNode a read-only NameNode – Contains all file system metadata: files and directories excluding block locations – Can perform NameNode operations that don’t modify namespace • BN maintains up-to-date in-memory image of file system namespace always synchronized with the NameNode state – NameNode streams journal to BackupNode • BackupNode can create a checkpoint without downloading checkpoint and journal files from active NameNode • Intended to evolve into hot HA HDFS-2064 13
  • 14. eBay Inc. confidential Hadoop at eBay • 2011 started with 532-node 5 PB cluster running CDH2 • EBay 0.20.203-based build (Wilma) – Hadoop 0.20.203 – latest stable Apache release • HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo – 500+ users; 2000 jobs per day • Runs on 1000-node cluster – 24 PB – capacity, 72 GB RAM / node • Many smaller clusters • Stabilization of Hadoop platform based on 0.22 14
  • 15. eBay Inc. confidential Testing • One year of testing by different groups in Hadoop ecosystem • Extensive testing of append by HBase community • Fully automated build and certification with BigTop • Hadoop platform team at eBay – Extensive stabilization effort starting September – Most bugs found in 0.22 are also in trunk and 0.23 – All new features tested – Stress testing – Reliability testing • Works with: Pig 0.8, Hive 0.7, custom changes HBase 0.92, Oozie, open sourced Zookeeper, Cascading no changes needed 15
  • 16. eBay Inc. confidential Testing Tools, Examples • TeraSort, TestDFSIO, DistCp • GridMix, Rumen – production job traces • SLive – adjustable mix of HDFS operations, permanent load • Upgrade / rollback from 0.20.? and 0.20.203 to 0.22 • Oversubscribed cluster running out of memory • Loosing racks with running jobs and HBase – Cluster survived consecutive loss of 4 racks, shrinking to single rack with HBase still alive and MR jobs completing • Disk-fail-in-place helps identify bad drives during hardware burn-in 16
  • 17. eBay Inc. confidential Benchmarking • TestDFSIO: 10 GB files (same as 100 GB) • TeraSort: -5% (scheduler to blame) • YCSB - same • Internal eBay applications – same or better • Lots of tuning: Hadoop, Java, OS, HW – Gradual improvement of results 17 Throughput MB/sec Read Write Append Hadoop-0.22 100 84 83 0.20 breed 96 66 n/a
  • 18. eBay Inc. confidential Good to have for 0.22.1 • Restore Security • Disk Fail in place for MapReduce • Optimizations – Multiple tasks per heartbeat for CapacityScheduler – CapacityScheduler preemption • MR job and task limits • Cluster startup time • Add HA? • Merge MR-1.0 into Hadoop 0.22? 18
  • 19. eBay Inc. confidential Important • Works but not 0.20 – Good new features – Reliability is the first concern – Performance and missing functionality can be reconstructed • Community release – Not distributed / advertized by commercial distributors – Community involvement important • Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0 It’s the other way around – Go to Hadoop 0.22 instead • Forward-going release progress – Stop porting new features, start releasing them 19
  • 20. eBay Inc. confidential Thank you 20 Hadoop 0.22 Contributions Accepted