Introduction to Hadoop - The Essentials

Introduction to Hadoop
The Essentials
November 25, 2013

Fadi Yousuf

About Me
•
•
•
•
•

Founder and Managing Director of Axeldata Systems
13+ years involved in designing data architectures
Previous life at Sun, Cisco, Oracle, Google, F5 Networks
Working with Hadoop since 2011
Certified as Cloudera Hadoop Developer, Administrator
and HBase Specialist
• Authorized Cloudera Hadoop trainer
• Perspective - Hadoop is the foundation of scalable big
data platforms
© 2013. Axeldata Systems FZ-LLC

2

Why Hadoop?
•
•
•

RDBMS technology has served us well for
30+ years
Excellent for low-latency, real-time
transaction-oriented data processing
In the age of big data, RDBMS has many
limitations:
– Volume: shared-all architecture limits linear
scalability and requires fork-lift upgrades of
hardware infrastructure when limits are reached
– Variety: data has to fit nicely in rows and column,
with a rigid schema, suitable for structured data
but fails to handle unstructured data
– Velocity: ingesting data at speed means you can’t
afford the time to shape data into the clean
structures of relational databases


3

A Brief History of Hadoop
1000-node
Yahoo! cluster

Google publish
MapReduce
paper
Google publish
GFS paper

Nutch rearchitecture

Nutch
created

2002

Hadoop subproject

2003


2004

2005

2006

First
commercial
distribution
Top-level
Apache Project

2007

2008
4

Hadoop 2.0

Hive, Pig,
HBase graduate

2009

Impala, the
first real-time
query engine
Further
commercial
distributions

2010

2011

2012

2013

The Birth of Hadoop
“The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not
used elsewhere: those are my naming
criteria. Kids are good at generating
such.”
- Doug Cutting, Creator of Hadoop


5

Hadoop: The Big Data Platform
It is a framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models


6

Core Hadoop Concepts
• Applications are written in high level code
– Developers don’t need to worry about network programming and
dependencies

• Minimal communication between the nodes
– Shared nothing architecture

• Move compute to storage, not the opposite
– Computation happens locally on each machine
– No need to move data around

• Failure is accepted and tolerated
– Data is replicated multiple times across different machines

7

Hadoop Then…
• Storage
Batch MR

– Hadoop
Distributed File
System
(HDFS)

Resource Management

• Programming
Framework

Storage

Integration


– MapReduce
8

Hadoop Now…
SQL

Searc
h

Math
&
Stats

InMemor
y

• Storage
…

Security

Metadata

Batch
MR

Resource Management

• Programming
Framework

Storage

Integration

– MapReduce
Source: Cloudera


– Hadoop
Distributed File
System
(HDFS)

9

What is HDFS?
• Distributed file system
• Breaks large files into
smaller blocks that are
stored on clusters of nodes
• Master-Slave architecture
• Processes:
– NameNode (Master)
– Standby NameNode (Master)
– DataNode (Slave)

10

Namenode
Standby
NameNode
Datanode
Datanode
Datanode

Datanode

HDFS Architecture

metadata
File1
File2

metadata
Block
1
2
3
4
5

NameNode

Location
n1r1 n1r2 n2r2
n1r1 n1r2 n4r2
n2r1 n1r3 n3r3
n4r1 n2r3 n3r3
n3r1 n3r2 n4r2

Blocks 1 2 3
Blocks 4 5

Standby NameNode
64MB

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1


Rack2
11

DataNodes

Rack3

What is MapReduce (MRv1)?
• Programming Framework
• Breaks processing into 2
phases:
– Map phase
– Reduce phase

TaskTracker
TaskTracker

• Master-Slave architecture
• Processes:
– JobTracker (Master)
– TaskTracker (Slave)

JobTracker

TaskTracker
TaskTracker

TaskTracker

12

MapReduce
Job

JobTracker

Task

Task

Task

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1


Rack2
13

TaskTrackers

Rack3

MapReduce: The Mapper
• Is a function that performs the map phase
• Each mapper usually operates on a single HDFS
block
• Takes a key and value as input can generate
multiple keys and values as output
• <k1,v1>  list(<k2,v2>)
• The output of all mappers are then sorted by key

14

MapReduce: The Reducer
• Is a function that performs the reduce phase
• Each reducer operates on a portion of the output
of all mappers
• Takes a key with a list of all values as input and
generates an aggregate of the values for each
key
• <k2,list(v2)>  list(<k3,v3>)

15

MapReduce Data Flow
Input
HDFS

sort
Split 0

Output
HDFS

copy

Map

merge
Reduce

Part 0

Reduce

Part 1

sort
Split 1

Map
merge
sort

Split 2


Map

16

HDFS & MapReduce Example: Word Count
Original File
I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.
And I shall have some peace there,
for peace comes dropping slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.
I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.

File on HDFS

Mapper

I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.

Map

And I shall have some peace
there, for peace comes dropping
slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.

Map

I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.

Map

Reduce

Reduce

Reduce

17

Output

Demo: Word Count on Hadoop


18

Querying Data in Hadoop
Apache Hive

Apache Pig

• Developed at Facebook
• Data warehouse infrastructure built
on top of Hadoop for providing data
summarization, query, and analysis
• Provides a mechanism to project
structure onto this data and query
the data using a SQL-like language
called HiveQL

• Developed at Yahoo!
• High-level platform for creating
MapReduce programs used with
Hadoop
• Has a language called PigLatin
• Can be extended with UDFs written
in Java, Python and other
languages


19

Hadoop Ecosystem
• Avro: a data serialization system
• Flume: a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• HBase: a scalable, distributed database that supports structured
data storage for large tables
• Mahout: a Scalable machine learning and data mining library
• Oozie: a workflow scheduler system to manage Apache Hadoop
jobs.
• Sqoop: a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational
databases.
• Zookeeper: a high-performance coordination service for distributed
applications

20

Yet Another Resource Negotiator (YARN)
– Also known as: YARN (MapReduce v2)
– New framework that facilitates writing arbitrary
distributed processing frameworks and applications.
– Splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
– Can run applications that do not follow the
MapReduce model

21

Learn Hadoop
• Download the Cloudera QuickStart VM
–
–
–
–

http://bit.ly/1b00iZj
To make it easy for you to get started with Hadoop
Cloudera Distribution including Apache Hadoop (CDH)
With Cloudera Manager, Cloudera Impala, and Cloudera Search,
this virtual machine includes everything you need

• Formal training as Developer, Administrator, Analyst and other
• Free Courseware on Udacity: Introduction to Hadoop and
MapReduce
– https://www.udacity.com/course/ud617


22

Other Hadoop Resources
Apache Project Websites
Hadoop:
Hive:
Pig:
Sqoop:
Flume:

http://hadoop.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://sqoop.apache.org/
http://flume.apache.org/

Original GFS and MapReduce Papers
GFS:
http://bit.ly/VZk9VL
MapReduce: http://bit.ly/8VDMHO

23

Community
A community of Hadoop professionals
and users in the region

meetup.com/Hadoop-User-Group-UAE/

24

Q&A

25

fadi@axeldata.com
www.axeldata.com
Hadoop and the Hadoop elephant logo
are trademarks of the Apache Software
Foundation. All other trademarks are
the property of their respective owners.

Introduction to Hadoop - The Essentials

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Introduction to Hadoop - The Essentials (20)

Recently uploaded (20)

Introduction to Hadoop - The Essentials

Editor's Notes