SlideShare a Scribd company logo
Johan Oskarsson

   Developer at Last.fm
Hadoop and Hive committer
What is HDFS?

 Hadoop Hadoop Distributed FileSystem
 Two server types
     Namenode - keeps track of block locations
     Datanode - stores blocks
 Files commonly split up into 128mb blocks
 Replicated to 3 datanodes by default
 Scales well: ~4000 nodes
 Write once
 Large files
"Can you use HDFS in
    production?"
Yes

We have used it in production since
2006, but then again we are insane.
Who is using HDFS in production?

  Yahoo! Largest cluster 4000 nodes (14PB raw storage)
  Facebook. 600 nodes (2PB raw storage)

  Powerset (Microsoft). "up to 400 instances"

  Last.fm. 31 nodes (110TB raw storage)

  ... see more at http://wiki.apache.org/hadoop/PoweredBy
What do they use Hadoop for?

  Yahoo! search index, Yahoo! anti spam, etc

  Facebook ad, profile and application monitoring, etc

  Powerset search index, heavy HBase users

  Last.fm charts, A/B testing stats, site metrics and reporting
"Does HDFS meet people's
needs? If not, what can we do?"
Use case - MR batch jobs

Scenario
1. Large source data files are inserted into HDFS
2. MapReduce job is run
3. Output is saved to HDFS

   HDFS is a great choice for this use case
   Shorter downtime is acceptable
   Backups for important data
   Permissions + trash to avoid user error
Use case - Serving files to a website

Scenario
1. User visits a website to browse photos
2. Lots of image files are requested from HDFS

Potential issues and solutions
   HDFS isn't written for many small files
       Namenode ram limits number of files
       Use HBase or similar
   Namenode goes down
       Crazy "double cluster" solution
       Standby namenode HADOOP-4539
    HDFS isn't really written for low response times
       Work is being done, not high priority
   Use GlusterFS or MogileFS instead
Use case - Reliable, realtime log storage

Scenario
1. A stream of logging events is generated
2. The stream is written directly to HDFS

Potential issues and solutions
   Problems with long write sessions
       HDFS-200, HADOOP-6099, HDFS-278
   Namenode goes down
       Crazy "double cluster" solution
       Standby namenode HADOOP-4539
   Appends not stable
       HDFS-265
Potential dealbreakers

  Small files problem™
     Use archives, sequencefiles or HBase
  Appends/sync not stable
  Namenode not highly available
  Relatively high latency reads
Improvements

In progress or completed
    HADOOP-4539 - Streaming edits to a standby NN
    HDFS-265 - Appends
    HDFS-245 - Symbolic links


Wish list
   HDFS-209 - Tool to edit namenode metadata files
   HDFS-220 - Transparent data archiving off HDFS
   HDFS-503 - Reduce disk space used with erasure coding
Competitors

  Hadoop MapReduce compatible
    CloudStore - http://kosmosfs.sourceforge.net/

  Low response time
     MogileFS - http://www.danga.com/mogilefs/
     GlusterFS - http://www.gluster.org/

More Related Content

PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Introduction to Big Data & Hadoop
PDF
Introduction to Hadoop
ODP
Hadoop - Overview
PPT
Hadoop - Introduction to Hadoop
PPT
Hadoop Technologies
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PPSX
HDFS: Hadoop Distributed Filesystem
Introduction to Big Data & Hadoop
Introduction to Hadoop
Hadoop - Overview
Hadoop - Introduction to Hadoop
Hadoop Technologies
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)

What's hot (20)

PPTX
Hadoop technology
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
HADOOP TECHNOLOGY ppt
PPT
Hadoop Tutorial
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
A Basic Introduction to the Hadoop eco system - no animation
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PPTX
Hadoop overview
PPTX
Introduction to Hadoop
PPT
An Introduction to Hadoop
PPT
Hadoop hive presentation
ODP
Hadoop demo ppt
PPTX
Introduction to Hadoop Technology
PPTX
Pptx present
PDF
Introduction to Hadoop part1
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPT
Hadoop
PPTX
Hadoop File system (HDFS)
PPTX
Introduction to Hadoop and Hadoop component
PPT
Seminar Presentation Hadoop
Hadoop technology
Introduction to Big Data & Hadoop Architecture - Module 1
HADOOP TECHNOLOGY ppt
Hadoop Tutorial
Practical Problem Solving with Apache Hadoop & Pig
A Basic Introduction to the Hadoop eco system - no animation
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Hadoop overview
Introduction to Hadoop
An Introduction to Hadoop
Hadoop hive presentation
Hadoop demo ppt
Introduction to Hadoop Technology
Pptx present
Introduction to Hadoop part1
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop
Hadoop File system (HDFS)
Introduction to Hadoop and Hadoop component
Seminar Presentation Hadoop
Ad

Similar to HDFS (20)

PDF
Setting High Availability in Hadoop Cluster
PDF
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
PDF
Apache Hadoop In Theory And Practice
PPT
Hadoop training by keylabs
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Hadoop Architecture and HDFS
PPTX
Hadoop - HDFS
PPTX
Hadoop at a glance
PPTX
Big data with HDFS and Mapreduce
PDF
Hadoop Distributed File System in Big data
PPTX
Introduction to HDFS and MapReduce
PPTX
Giraffa - November 2014
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Apache Hadoop Big Data Technology
ODP
Hadoop HDFS by rohitkapa
PPTX
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
PPTX
Hadoop Developer
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Understanding Hadoop
PDF
Hadoop overview.pdf
Setting High Availability in Hadoop Cluster
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Apache Hadoop In Theory And Practice
Hadoop training by keylabs
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop Architecture and HDFS
Hadoop - HDFS
Hadoop at a glance
Big data with HDFS and Mapreduce
Hadoop Distributed File System in Big data
Introduction to HDFS and MapReduce
Giraffa - November 2014
Topic 9a-Hadoop Storage- HDFS.pptx
Apache Hadoop Big Data Technology
Hadoop HDFS by rohitkapa
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop Developer
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Understanding Hadoop
Hadoop overview.pdf
Ad

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
What does Rename Do: (detailed version)
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
@Dissidentbot: dissent will be automated!
PPTX
PUT is the new rename()
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Apache Spark and Object Stores —for London Spark User Group
PPTX
Spark Summit East 2017: Apache spark and object stores
PPTX
Hadoop, Hive, Spark and Object Stores
PPTX
Apache Spark and Object Stores
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Slider: Applications on YARN
PPTX
YARN Services
Hadoop Vectored IO
The age of rename() is over
What does Rename Do: (detailed version)
Put is the new rename: San Jose Summit Edition
@Dissidentbot: dissent will be automated!
PUT is the new rename()
Extreme Programming Deployed
I hate mocking
What does rename() do?
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Apache Spark and Object Stores —for London Spark User Group
Spark Summit East 2017: Apache spark and object stores
Hadoop, Hive, Spark and Object Stores
Apache Spark and Object Stores
Household INFOSEC in a Post-Sony Era
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate
Slider: Applications on YARN
YARN Services

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation

HDFS

  • 1. Johan Oskarsson Developer at Last.fm Hadoop and Hive committer
  • 2. What is HDFS? Hadoop Hadoop Distributed FileSystem Two server types Namenode - keeps track of block locations Datanode - stores blocks Files commonly split up into 128mb blocks Replicated to 3 datanodes by default Scales well: ~4000 nodes Write once Large files
  • 3. "Can you use HDFS in production?"
  • 4. Yes We have used it in production since 2006, but then again we are insane.
  • 5. Who is using HDFS in production? Yahoo! Largest cluster 4000 nodes (14PB raw storage) Facebook. 600 nodes (2PB raw storage) Powerset (Microsoft). "up to 400 instances" Last.fm. 31 nodes (110TB raw storage) ... see more at http://wiki.apache.org/hadoop/PoweredBy
  • 6. What do they use Hadoop for? Yahoo! search index, Yahoo! anti spam, etc Facebook ad, profile and application monitoring, etc Powerset search index, heavy HBase users Last.fm charts, A/B testing stats, site metrics and reporting
  • 7. "Does HDFS meet people's needs? If not, what can we do?"
  • 8. Use case - MR batch jobs Scenario 1. Large source data files are inserted into HDFS 2. MapReduce job is run 3. Output is saved to HDFS HDFS is a great choice for this use case Shorter downtime is acceptable Backups for important data Permissions + trash to avoid user error
  • 9. Use case - Serving files to a website Scenario 1. User visits a website to browse photos 2. Lots of image files are requested from HDFS Potential issues and solutions HDFS isn't written for many small files Namenode ram limits number of files Use HBase or similar Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 HDFS isn't really written for low response times Work is being done, not high priority Use GlusterFS or MogileFS instead
  • 10. Use case - Reliable, realtime log storage Scenario 1. A stream of logging events is generated 2. The stream is written directly to HDFS Potential issues and solutions Problems with long write sessions HDFS-200, HADOOP-6099, HDFS-278 Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 Appends not stable HDFS-265
  • 11. Potential dealbreakers Small files problem™ Use archives, sequencefiles or HBase Appends/sync not stable Namenode not highly available Relatively high latency reads
  • 12. Improvements In progress or completed HADOOP-4539 - Streaming edits to a standby NN HDFS-265 - Appends HDFS-245 - Symbolic links Wish list HDFS-209 - Tool to edit namenode metadata files HDFS-220 - Transparent data archiving off HDFS HDFS-503 - Reduce disk space used with erasure coding
  • 13. Competitors Hadoop MapReduce compatible CloudStore - http://kosmosfs.sourceforge.net/ Low response time MogileFS - http://www.danga.com/mogilefs/ GlusterFS - http://www.gluster.org/