SlideShare a Scribd company logo
The Family of Hadoop
            Nham Xuan Nam
     nhamxuannam [at] gmail.com
     http://namnham.blogspot.com




    Barcamp Saigon, December 13 2009
Content
   History
   Sub-projects
   HDFS
   Map Reduce
   HBase
   Hive
History
   created by Doug Cutting, the creator of
    Lucene.
   Lucene: open source index & search library.
   Nutch: Lucene-based web crawler.
   Jun 2003, there was a successful 100
    million page Nutch demo system.
   Nutch problem: its architecture could not
    scale to the billions of pages.
History
 Oct 2003, Google published the paper
“The Google File System”.
   In 2004, Nutch team wrote an open source implementation
    of GFS, called Nutch Distributed File System (NDFS).
   Dec 2004, Google published the paper “MapReduce:
    Simplified Data Processing on Large Clusters”.
   In 2005, Nutch team implemented MapReduce in Nutch.
   Mid 2005, all the major Nutch algorithms had been ported
    to run using MapReduce and NDFS.
History
   Feb 2006, Nutch's NDFS and the MapReduce
    implementation formed Hadoop project.
   Doug Cutting joined Yahoo!.
   Jan 2008, Hadoop became Apache top-level
    project.
   Feb 2008, Yahoo! production search index
    was generated by a 10,000-core Hadoop
    cluster.
History




Source: http://wiki.apache.org/hadoop/PoweredBy
Sub-projects
The Family of Hadoop
Architecture
Data Model
   File stored as blocks (default size: 64M)
   Reliability through replication
    – Each block is replicated to several datanodes
Namenode & Datanodes
   Namenode (master)
    – manages the filesystem namespace
    – maintains the filesystem tree and metadata for all the
      files and directories in the tree.

   Datanodes (slaves)
    – store data in the local file system
    – Periodically report back to the namenode with lists of all
      existing blocks

   Clients communicate with both namenode and datanodes.
Data Flow
Data Flow
Accessibility
   FileSystem Java API
    – org.apache.hadoop.fs.*

   Web Interface

   Commands for HDFS users
$ hadoop dfs ­mkdir /barcamp

$ hadoop dfs ­ls /barcamp

   Commands for HDFS admins
$ hadoop dfsadmin ­report

$ hadoop dfsadmin ­refreshNodes
The Family of Hadoop
Programming Model
Programming Model
   Data is a stream of keys and values
   Map

    – Input: <key1,value1> pairs from data source

    – Output: immediate <key2,value2> pairs

   Reduce
    – Called once per a key, in sorted order
       Input: <key2, list of value2>

       Output: <key3,value3> pairs
Data Flow
WordCount Example
 File01:                                  File02:
 Hello Barcamp Hello Everyone             Hello Hadoop Hello Everyone

<_, Hello Barcamp Hello Everyone>       <_, Hello Hadoop Hello Everyone>




         <Hello,     2>                          <Hello,     2>
         <Barcamp, 1>                            <Hadoop,    1>
         <Everyone,  1>                          <Everyone,  1>

                          <Barcamp,    [1]>
                          <Hadoop,     [1]>
                          <Hello,      [2,2]>
                          <Everyone,   [1,1]>




                            <Barcamp, 1>
                            <Hadoop,    1>
                            <Hello,     4>
                            <Everyone,  2>
MapReduce in Hadoop
   JobTracker (master)
    – handling all jobs.
    – scheduling tasks on the slaves.
    – monitoring & re-executing tasks.

   TaskTrackers (slaves)
    – execute the tasks.

   Task
    – run an individual map or reduce.
MapReduce in Hadoop
The Family of Hadoop
Introduction
   Nov 2006, Google released the paper “Bigtable: A
    Distributed Storage System for Structured Data”
   BigTable: distributed, column-oriented store, built on top of
    Google File System.
   HBase: open source implementation of BigTable, built on
    top of HDFS.
Data Model
   Data are stored in tables of rows and columns.
   Cells are ”versioned”
→ Data are addressed by row/column/version key.
   Table rows are sorted by row key, the table's primary key.
   Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
   Tables are stored in regions.
   Region: a row range [start-key : end-key)
Data Model
Architecture
Architecture
   Master Server
    – assigns regions to regionservers
    – monitors the health of regionservers
    – handles administrative funtions

   RegionServers
     – contain regions and handle client read/write requests

   Catalog Tables (ROOT and META)
     – maintain the current list, state, recent history, and
       location of all regions.
Accessibility
   Client API
org.apache.hadoop.hbase
.client.*

   HBase Shell
$ bin/hbase shell
hbase> 

   Web Interface
The Family of Hadoop
Introduction
   started at Facebook
   an open source data warehousing solution
    built on top of Hadoop
   for managing and querying structured data
   Hive QL: SQL-like query language
    – compiled into map-reduce jobs
   log processing, data mining,...
Data Model
   Tables
    – analogous to tables in RDBMS
    – rows are organized into typed columns
    – all the data is stored in a directory in HDFS

   Partitions
    – determine the distribution of data within sub-directories
      of the table directory

   Buckets
    – based on the hash of a column in the table
    – Each bucket is stored as a file in the partition directory
Architecture
Architecture
   Metastore
    – contains metadata about data stored in Hive.
    – stored in any SQL backend or an embedded Derby.
    – Database: a namespace for tables
    – Table metadata: column types, physical layout,...
    – Partition metadata

   Compiler

   Excution Engine

   Shell
Hive Query Language
   Data Definition (DDL) statements
    – CREATE/DROP/ALTER TABLE
    – SHOW TABLE/PARTITIONS

   Data Manipulation (DML) statements
    – LOAD DATA
    – INSERT
    – SELECT

   User Defined functions: UDF/UDAF
Hive @ Facebook
The End




Thank you!

More Related Content

PPT
Hadoop hive presentation
PPT
Hadoop Technologies
PPTX
Hadoop Installation presentation
PPTX
Hadoop introduction seminar presentation
PPTX
BIG DATA: Apache Hadoop
PPTX
Apache Hive
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Session 01 - Into to Hadoop
Hadoop hive presentation
Hadoop Technologies
Hadoop Installation presentation
Hadoop introduction seminar presentation
BIG DATA: Apache Hadoop
Apache Hive
HADOOP TECHNOLOGY ppt
Session 01 - Into to Hadoop

What's hot (20)

PDF
Apache Hadoop and HBase
PPTX
Hadoop installation with an example
PPTX
Pptx present
PPT
Hadoop - Introduction to Hadoop
PDF
Intro to HBase
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PPTX
Introduction to Hadoop and Hadoop component
PPT
Hadoop Tutorial
PPT
An Introduction to Hadoop
PPT
Hadoop
PPT
Hive(ppt)
PPTX
Big Data and Hadoop - An Introduction
PDF
Hadoop installation, Configuration, and Mapreduce program
PDF
Introduction to Hadoop
PDF
Hive Quick Start Tutorial
PPSX
DOC
Hadoop cluster configuration
PDF
Hadoop operations basic
PDF
Introduction to Hadoop
Apache Hadoop and HBase
Hadoop installation with an example
Pptx present
Hadoop - Introduction to Hadoop
Intro to HBase
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
Introduction to Hadoop and Hadoop component
Hadoop Tutorial
An Introduction to Hadoop
Hadoop
Hive(ppt)
Big Data and Hadoop - An Introduction
Hadoop installation, Configuration, and Mapreduce program
Introduction to Hadoop
Hive Quick Start Tutorial
Hadoop cluster configuration
Hadoop operations basic
Introduction to Hadoop
Ad

Viewers also liked (20)

PDF
Funeral insurance quotes
PPTX
Innovation in the telecommunication Industry
PPTX
Local commercial insurance
PPTX
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
PPTX
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
PDF
MRO Market Update and Industry Trends
 
PDF
Beginners SharePoint introduction
DOCX
FREE Law 531 final exam
PPTX
Kristen's cookie company
PDF
Neerogi - A Patient Information Management System (PIMS)
PPT
Mrs.Wishy-Washy
PDF
Preventaloss Loss Adjusters - Proposal
PDF
Lender Essentials: Environmental Liability Insurance
 
PDF
Summary -First Break All The Rules
PDF
Chemical plant design &amp; construction 2016
PDF
Marketing of Financial Products and Services
PPTX
Hr value proposition
PDF
Hydraulics actuation system
PPTX
Viral infections of Oral Cavity
PPT
Prostate Cancer
Funeral insurance quotes
Innovation in the telecommunication Industry
Local commercial insurance
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
MRO Market Update and Industry Trends
 
Beginners SharePoint introduction
FREE Law 531 final exam
Kristen's cookie company
Neerogi - A Patient Information Management System (PIMS)
Mrs.Wishy-Washy
Preventaloss Loss Adjusters - Proposal
Lender Essentials: Environmental Liability Insurance
 
Summary -First Break All The Rules
Chemical plant design &amp; construction 2016
Marketing of Financial Products and Services
Hr value proposition
Hydraulics actuation system
Viral infections of Oral Cavity
Prostate Cancer
Ad

Similar to The Family of Hadoop (20)

PDF
Hadoop Overview & Architecture
 
PDF
Apache hadoop
PDF
Understanding Hadoop
PDF
Hadoop programming
PPTX
The Evolution of the Hadoop Ecosystem
PDF
An Introduction to Apache Hadoop, Mahout and HBase
PDF
Apache Hadoop & Friends at Utah Java User's Group
PPT
Presentation
PPTX
Managing Big data with Hadoop
PPSX
Hadoop-Quick introduction
PDF
Scaling Storage and Computation with Hadoop
PPTX
Large scale computing with mapreduce
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
Hadoop, Taming Elephants
PDF
Hadoop Overview kdd2011
PPT
PDF
Hadoop, HDFS and MapReduce
PPTX
2. hadoop fundamentals
KEY
HBase and Hadoop at Urban Airship
Hadoop Overview & Architecture
 
Apache hadoop
Understanding Hadoop
Hadoop programming
The Evolution of the Hadoop Ecosystem
An Introduction to Apache Hadoop, Mahout and HBase
Apache Hadoop & Friends at Utah Java User's Group
Presentation
Managing Big data with Hadoop
Hadoop-Quick introduction
Scaling Storage and Computation with Hadoop
Large scale computing with mapreduce
Apache hadoop, hdfs and map reduce Overview
Hadoop, Taming Elephants
Hadoop Overview kdd2011
Hadoop, HDFS and MapReduce
2. hadoop fundamentals
HBase and Hadoop at Urban Airship

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)

The Family of Hadoop

  • 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  • 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  • 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  • 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  • 10. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 14. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  • 17. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  • 19. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  • 20. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  • 23. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  • 24. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  • 27. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  • 28. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  • 30. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  • 31. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  • 33. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  • 34. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF