The Family of Hadoop

The Family of Hadoop
Nham Xuan Nam
nhamxuannam [at] gmail.com
http://namnham.blogspot.com

Barcamp Saigon, December 13 2009

Content
 History
 Sub-projects
 HDFS
 Map Reduce
 HBase
 Hive

History
 created by Doug Cutting, the creator of
Lucene.
 Lucene: open source index & search library.
 Nutch: Lucene-based web crawler.
 Jun 2003, there was a successful 100
million page Nutch demo system.
 Nutch problem: its architecture could not
scale to the billions of pages.

History
 Oct 2003, Google published the paper
“The Google File System”.
 In 2004, Nutch team wrote an open source implementation
of GFS, called Nutch Distributed File System (NDFS).
 Dec 2004, Google published the paper “MapReduce:
Simplified Data Processing on Large Clusters”.
 In 2005, Nutch team implemented MapReduce in Nutch.
 Mid 2005, all the major Nutch algorithms had been ported
to run using MapReduce and NDFS.

History
 Feb 2006, Nutch's NDFS and the MapReduce
implementation formed Hadoop project.
 Doug Cutting joined Yahoo!.
 Jan 2008, Hadoop became Apache top-level
project.
 Feb 2008, Yahoo! production search index
was generated by a 10,000-core Hadoop
cluster.

History

Source: http://wiki.apache.org/hadoop/PoweredBy

Data Model
 File stored as blocks (default size: 64M)
 Reliability through replication
– Each block is replicated to several datanodes

Namenode & Datanodes
 Namenode (master)
– manages the filesystem namespace
– maintains the filesystem tree and metadata for all the
files and directories in the tree.

 Datanodes (slaves)
– store data in the local file system
– Periodically report back to the namenode with lists of all
existing blocks

 Clients communicate with both namenode and datanodes.

Accessibility
 FileSystem Java API
– org.apache.hadoop.fs.*

 Web Interface

 Commands for HDFS users
$ hadoop dfs mkdir /barcamp

$ hadoop dfs ls /barcamp

 Commands for HDFS admins
$ hadoop dfsadmin report

$ hadoop dfsadmin refreshNodes

Programming Model
 Data is a stream of keys and values
 Map

– Input: <key1,value1> pairs from data source

– Output: immediate <key2,value2> pairs

 Reduce
– Called once per a key, in sorted order
 Input: <key2, list of value2>

 Output: <key3,value3> pairs

WordCount Example
File01: File02:
Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone

<_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone>

<Hello, 2> <Hello, 2>
<Barcamp, 1> <Hadoop, 1>
<Everyone, 1> <Everyone, 1>

<Barcamp, [1]>
<Hadoop, [1]>
<Hello, [2,2]>
<Everyone, [1,1]>

<Barcamp, 1>
<Hadoop, 1>
<Hello, 4>
<Everyone, 2>

MapReduce in Hadoop
 JobTracker (master)
– handling all jobs.
– scheduling tasks on the slaves.
– monitoring & re-executing tasks.

 TaskTrackers (slaves)
– execute the tasks.

 Task
– run an individual map or reduce.

Introduction
 Nov 2006, Google released the paper “Bigtable: A
Distributed Storage System for Structured Data”
 BigTable: distributed, column-oriented store, built on top of
Google File System.
 HBase: open source implementation of BigTable, built on
top of HDFS.

Data Model
 Data are stored in tables of rows and columns.
 Cells are ”versioned”
→ Data are addressed by row/column/version key.
 Table rows are sorted by row key, the table's primary key.
 Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
 Tables are stored in regions.
 Region: a row range [start-key : end-key)

Architecture
 Master Server
– assigns regions to regionservers
– monitors the health of regionservers
– handles administrative funtions

 RegionServers
– contain regions and handle client read/write requests

 Catalog Tables (ROOT and META)
– maintain the current list, state, recent history, and
location of all regions.

Accessibility
 Client API
org.apache.hadoop.hbase
.client.*

 HBase Shell
$ bin/hbase shell
hbase>

 Web Interface

Introduction
 started at Facebook
 an open source data warehousing solution
built on top of Hadoop
 for managing and querying structured data
 Hive QL: SQL-like query language
– compiled into map-reduce jobs
 log processing, data mining,...

Data Model
 Tables
– analogous to tables in RDBMS
– rows are organized into typed columns
– all the data is stored in a directory in HDFS

 Partitions
– determine the distribution of data within sub-directories
of the table directory

 Buckets
– based on the hash of a column in the table
– Each bucket is stored as a file in the partition directory

Architecture
 Metastore
– contains metadata about data stored in Hive.
– stored in any SQL backend or an embedded Derby.
– Database: a namespace for tables
– Table metadata: column types, physical layout,...
– Partition metadata

 Compiler

 Excution Engine

 Shell

Hive Query Language
 Data Definition (DDL) statements
– CREATE/DROP/ALTER TABLE
– SHOW TABLE/PARTITIONS

 Data Manipulation (DML) statements
– LOAD DATA
– INSERT
– SELECT

 User Defined functions: UDF/UDAF

The Family of Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to The Family of Hadoop (20)

Recently uploaded (20)

The Family of Hadoop