SlideShare a Scribd company logo
Introduction to Hadoop
The Essentials
November 25, 2013

Fadi Yousuf
About Me
•
•
•
•
•

Founder and Managing Director of Axeldata Systems
13+ years involved in designing data architectures
Previous life at Sun, Cisco, Oracle, Google, F5 Networks
Working with Hadoop since 2011
Certified as Cloudera Hadoop Developer, Administrator
and HBase Specialist
• Authorized Cloudera Hadoop trainer
• Perspective - Hadoop is the foundation of scalable big
data platforms
© 2013. Axeldata Systems FZ-LLC

2
Why Hadoop?
•
•
•

RDBMS technology has served us well for
30+ years
Excellent for low-latency, real-time
transaction-oriented data processing
In the age of big data, RDBMS has many
limitations:
– Volume: shared-all architecture limits linear
scalability and requires fork-lift upgrades of
hardware infrastructure when limits are reached
– Variety: data has to fit nicely in rows and column,
with a rigid schema, suitable for structured data
but fails to handle unstructured data
– Velocity: ingesting data at speed means you can’t
afford the time to shape data into the clean
structures of relational databases

© 2013. Axeldata Systems FZ-LLC

3
A Brief History of Hadoop
1000-node
Yahoo! cluster

Google publish
MapReduce
paper
Google publish
GFS paper

Nutch rearchitecture

Nutch
created

2002

Hadoop subproject

2003

© 2013. Axeldata Systems FZ-LLC

2004

2005

2006

First
commercial
distribution
Top-level
Apache Project

2007

2008
4

Hadoop 2.0

Hive, Pig,
HBase graduate

2009

Impala, the
first real-time
query engine
Further
commercial
distributions

2010

2011

2012

2013
The Birth of Hadoop
“The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not
used elsewhere: those are my naming
criteria. Kids are good at generating
such.”
- Doug Cutting, Creator of Hadoop

© 2013. Axeldata Systems FZ-LLC

5
Hadoop: The Big Data Platform
It is a framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models

© 2013. Axeldata Systems FZ-LLC

6
Core Hadoop Concepts
• Applications are written in high level code
– Developers don’t need to worry about network programming and
dependencies

• Minimal communication between the nodes
– Shared nothing architecture

• Move compute to storage, not the opposite
– Computation happens locally on each machine
– No need to move data around

• Failure is accepted and tolerated
– Data is replicated multiple times across different machines
© 2013. Axeldata Systems FZ-LLC

7
Hadoop Then…
• Storage
Batch MR

– Hadoop
Distributed File
System
(HDFS)

Resource Management

• Programming
Framework

Storage

Integration

© 2013. Axeldata Systems FZ-LLC

– MapReduce
8
Hadoop Now…
SQL

Searc
h

Math
&
Stats

InMemor
y

• Storage
…

Security

Metadata

Batch
MR

Resource Management

• Programming
Framework

Storage

Integration

– MapReduce
Source: Cloudera

© 2013. Axeldata Systems FZ-LLC

– Hadoop
Distributed File
System
(HDFS)

9
What is HDFS?
• Distributed file system
• Breaks large files into
smaller blocks that are
stored on clusters of nodes
• Master-Slave architecture
• Processes:
– NameNode (Master)
– Standby NameNode (Master)
– DataNode (Slave)
© 2013. Axeldata Systems FZ-LLC

10

Namenode
Standby
NameNode
Datanode
Datanode
Datanode

Datanode
HDFS Architecture

metadata
File1
File2

metadata
Block
1
2
3
4
5

NameNode

Location
n1r1 n1r2 n2r2
n1r1 n1r2 n4r2
n2r1 n1r3 n3r3
n4r1 n2r3 n3r3
n3r1 n3r2 n4r2

Blocks 1 2 3
Blocks 4 5

Standby NameNode
64MB

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1

© 2013. Axeldata Systems FZ-LLC

Rack2
11

DataNodes

Rack3
What is MapReduce (MRv1)?
• Programming Framework
• Breaks processing into 2
phases:
– Map phase
– Reduce phase

TaskTracker
TaskTracker

• Master-Slave architecture
• Processes:
– JobTracker (Master)
– TaskTracker (Slave)
© 2013. Axeldata Systems FZ-LLC

JobTracker

TaskTracker
TaskTracker

TaskTracker

12
MapReduce
Job

JobTracker

Task

Task

Task

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1

© 2013. Axeldata Systems FZ-LLC

Rack2
13

TaskTrackers

Rack3
MapReduce: The Mapper
• Is a function that performs the map phase
• Each mapper usually operates on a single HDFS
block
• Takes a key and value as input can generate
multiple keys and values as output
• <k1,v1>  list(<k2,v2>)
• The output of all mappers are then sorted by key
© 2013. Axeldata Systems FZ-LLC

14
MapReduce: The Reducer
• Is a function that performs the reduce phase
• Each reducer operates on a portion of the output
of all mappers
• Takes a key with a list of all values as input and
generates an aggregate of the values for each
key
• <k2,list(v2)>  list(<k3,v3>)
© 2013. Axeldata Systems FZ-LLC

15
MapReduce Data Flow
Input
HDFS

sort
Split 0

Output
HDFS

copy

Map

merge
Reduce

Part 0

Reduce

Part 1

sort
Split 1

Map
merge
sort

Split 2

© 2013. Axeldata Systems FZ-LLC

Map

16
HDFS & MapReduce Example: Word Count
Original File
I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.
And I shall have some peace there,
for peace comes dropping slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.
I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.
© 2013. Axeldata Systems FZ-LLC

File on HDFS

Mapper

I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.

Map

And I shall have some peace
there, for peace comes dropping
slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.

Map

I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.

Map

Reduce

Reduce

Reduce

17

Output
Demo: Word Count on Hadoop

© 2013. Axeldata Systems FZ-LLC

18
Querying Data in Hadoop
Apache Hive

Apache Pig

• Developed at Facebook
• Data warehouse infrastructure built
on top of Hadoop for providing data
summarization, query, and analysis
• Provides a mechanism to project
structure onto this data and query
the data using a SQL-like language
called HiveQL

• Developed at Yahoo!
• High-level platform for creating
MapReduce programs used with
Hadoop
• Has a language called PigLatin
• Can be extended with UDFs written
in Java, Python and other
languages

© 2013. Axeldata Systems FZ-LLC

19
Hadoop Ecosystem
• Avro: a data serialization system
• Flume: a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• HBase: a scalable, distributed database that supports structured
data storage for large tables
• Mahout: a Scalable machine learning and data mining library
• Oozie: a workflow scheduler system to manage Apache Hadoop
jobs.
• Sqoop: a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational
databases.
• Zookeeper: a high-performance coordination service for distributed
applications
© 2013. Axeldata Systems FZ-LLC

20
Yet Another Resource Negotiator (YARN)
– Also known as: YARN (MapReduce v2)
– New framework that facilitates writing arbitrary
distributed processing frameworks and applications.
– Splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
– Can run applications that do not follow the
MapReduce model
© 2013. Axeldata Systems FZ-LLC

21
Learn Hadoop
• Download the Cloudera QuickStart VM
–
–
–
–

http://bit.ly/1b00iZj
To make it easy for you to get started with Hadoop
Cloudera Distribution including Apache Hadoop (CDH)
With Cloudera Manager, Cloudera Impala, and Cloudera Search,
this virtual machine includes everything you need

• Formal training as Developer, Administrator, Analyst and other
• Free Courseware on Udacity: Introduction to Hadoop and
MapReduce
– https://www.udacity.com/course/ud617

© 2013. Axeldata Systems FZ-LLC

22
Other Hadoop Resources
Apache Project Websites
Hadoop:
Hive:
Pig:
Sqoop:
Flume:

http://hadoop.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://sqoop.apache.org/
http://flume.apache.org/

Original GFS and MapReduce Papers
GFS:
http://bit.ly/VZk9VL
MapReduce: http://bit.ly/8VDMHO
© 2013. Axeldata Systems FZ-LLC

23
Community
A community of Hadoop professionals
and users in the region

meetup.com/Hadoop-User-Group-UAE/
© 2013. Axeldata Systems FZ-LLC

24
Q&A
© 2013. Axeldata Systems FZ-LLC

25
fadi@axeldata.com
www.axeldata.com
Hadoop and the Hadoop elephant logo
are trademarks of the Apache Software
Foundation. All other trademarks are
the property of their respective owners.

More Related Content

PPTX
PPT on Hadoop
PPTX
Big Data and Hadoop
PPT
Seminar Presentation Hadoop
PPTX
Hadoop
PDF
Introduction to Hadoop and MapReduce
PPTX
Introduction to Apache Hadoop Eco-System
PPT
Presentation on Hadoop Technology
PPTX
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
PPT on Hadoop
Big Data and Hadoop
Seminar Presentation Hadoop
Hadoop
Introduction to Hadoop and MapReduce
Introduction to Apache Hadoop Eco-System
Presentation on Hadoop Technology
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success

What's hot (20)

PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop Fundamentals I
PPTX
Hadoop and Big Data
PDF
Hadoop 101
 
PPT
Hadoop_Its_Not_Just_Internal_Storage_V14
PPTX
Big Data and Hadoop Introduction
PPTX
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
PPTX
Apache Hadoop
PPTX
Hadoop Tutorial For Beginners
PPTX
Introduction to Apache Hadoop
PDF
Hadoop Overview
 
PPTX
Big Data & Hadoop Tutorial
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PPTX
Introduction to Apache Hadoop Ecosystem
PPT
Hadoop Technology
ODP
Hadoop seminar
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PPTX
Big Data Performance and Capacity Management
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Hadoop introduction , Why and What is Hadoop ?
Hadoop Fundamentals I
Hadoop and Big Data
Hadoop 101
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Big Data and Hadoop Introduction
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Apache Hadoop
Hadoop Tutorial For Beginners
Introduction to Apache Hadoop
Hadoop Overview
 
Big Data & Hadoop Tutorial
Hadoop vs. RDBMS for Advanced Analytics
Introduction to Apache Hadoop Ecosystem
Hadoop Technology
Hadoop seminar
Big data Hadoop Analytic and Data warehouse comparison guide
Top Hadoop Big Data Interview Questions and Answers for Fresher
Big Data Performance and Capacity Management
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Ad

Viewers also liked (17)

PPT
Hadoop Security Preview
PPTX
Integrating hadoop - Big Data TechCon 2013
PDF
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
PPTX
Is Your Hadoop Environment Secure?
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Hadoop and Big Data Security
PPTX
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
PDF
Apache Hadoop Crash Course
PDF
10 Common Hadoop-able Problems Webinar
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPS
Data Warehouse 101
PDF
Intro to HDFS and MapReduce
DOCX
Big data and hadoop ecosystem essentials for managers
PPTX
DATA WAREHOUSING
PPTX
MapReduce in Simple Terms
PPTX
Business intelligence systems
Hadoop Security Preview
Integrating hadoop - Big Data TechCon 2013
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Is Your Hadoop Environment Secure?
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop and Big Data Security
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
The Future of Hadoop Security - Hadoop Summit 2014
Apache Hadoop Crash Course
10 Common Hadoop-able Problems Webinar
Part 1: Lambda Architectures: Simplified by Apache Kudu
Data Warehouse 101
Intro to HDFS and MapReduce
Big data and hadoop ecosystem essentials for managers
DATA WAREHOUSING
MapReduce in Simple Terms
Business intelligence systems
Ad

Similar to Introduction to Hadoop - The Essentials (20)

PDF
Hadoop framework thesis (3)
PPTX
002 Introduction to hadoop v3
PPTX
Introduction to BIg Data and Hadoop
PPTX
Hadoop_arunam_ppt
PDF
20131205 hadoop-hdfs-map reduce-introduction
PPTX
Hadoop info
PPTX
Hadoop basics
PDF
Hadoop paper
PPTX
Getting Started with Hadoop
DOCX
Hadoop Seminar Report
PPTX
Cap 10 ingles
PPTX
Cap 10 ingles
PPTX
Lecture 2 Hadoop.pptx
PPT
PPTX
Apache Hadoop Big Data Technology
PDF
Hadoop ecosystem
PPTX
Hadoop An Introduction
PPTX
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Hadoop introduction
Hadoop framework thesis (3)
002 Introduction to hadoop v3
Introduction to BIg Data and Hadoop
Hadoop_arunam_ppt
20131205 hadoop-hdfs-map reduce-introduction
Hadoop info
Hadoop basics
Hadoop paper
Getting Started with Hadoop
Hadoop Seminar Report
Cap 10 ingles
Cap 10 ingles
Lecture 2 Hadoop.pptx
Apache Hadoop Big Data Technology
Hadoop ecosystem
Hadoop An Introduction
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop introduction

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Big Data Technologies - Introduction.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Introduction to Hadoop - The Essentials

  • 1. Introduction to Hadoop The Essentials November 25, 2013 Fadi Yousuf
  • 2. About Me • • • • • Founder and Managing Director of Axeldata Systems 13+ years involved in designing data architectures Previous life at Sun, Cisco, Oracle, Google, F5 Networks Working with Hadoop since 2011 Certified as Cloudera Hadoop Developer, Administrator and HBase Specialist • Authorized Cloudera Hadoop trainer • Perspective - Hadoop is the foundation of scalable big data platforms © 2013. Axeldata Systems FZ-LLC 2
  • 3. Why Hadoop? • • • RDBMS technology has served us well for 30+ years Excellent for low-latency, real-time transaction-oriented data processing In the age of big data, RDBMS has many limitations: – Volume: shared-all architecture limits linear scalability and requires fork-lift upgrades of hardware infrastructure when limits are reached – Variety: data has to fit nicely in rows and column, with a rigid schema, suitable for structured data but fails to handle unstructured data – Velocity: ingesting data at speed means you can’t afford the time to shape data into the clean structures of relational databases © 2013. Axeldata Systems FZ-LLC 3
  • 4. A Brief History of Hadoop 1000-node Yahoo! cluster Google publish MapReduce paper Google publish GFS paper Nutch rearchitecture Nutch created 2002 Hadoop subproject 2003 © 2013. Axeldata Systems FZ-LLC 2004 2005 2006 First commercial distribution Top-level Apache Project 2007 2008 4 Hadoop 2.0 Hive, Pig, HBase graduate 2009 Impala, the first real-time query engine Further commercial distributions 2010 2011 2012 2013
  • 5. The Birth of Hadoop “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.” - Doug Cutting, Creator of Hadoop © 2013. Axeldata Systems FZ-LLC 5
  • 6. Hadoop: The Big Data Platform It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models © 2013. Axeldata Systems FZ-LLC 6
  • 7. Core Hadoop Concepts • Applications are written in high level code – Developers don’t need to worry about network programming and dependencies • Minimal communication between the nodes – Shared nothing architecture • Move compute to storage, not the opposite – Computation happens locally on each machine – No need to move data around • Failure is accepted and tolerated – Data is replicated multiple times across different machines © 2013. Axeldata Systems FZ-LLC 7
  • 8. Hadoop Then… • Storage Batch MR – Hadoop Distributed File System (HDFS) Resource Management • Programming Framework Storage Integration © 2013. Axeldata Systems FZ-LLC – MapReduce 8
  • 9. Hadoop Now… SQL Searc h Math & Stats InMemor y • Storage … Security Metadata Batch MR Resource Management • Programming Framework Storage Integration – MapReduce Source: Cloudera © 2013. Axeldata Systems FZ-LLC – Hadoop Distributed File System (HDFS) 9
  • 10. What is HDFS? • Distributed file system • Breaks large files into smaller blocks that are stored on clusters of nodes • Master-Slave architecture • Processes: – NameNode (Master) – Standby NameNode (Master) – DataNode (Slave) © 2013. Axeldata Systems FZ-LLC 10 Namenode Standby NameNode Datanode Datanode Datanode Datanode
  • 11. HDFS Architecture metadata File1 File2 metadata Block 1 2 3 4 5 NameNode Location n1r1 n1r2 n2r2 n1r1 n1r2 n4r2 n2r1 n1r3 n3r3 n4r1 n2r3 n3r3 n3r1 n3r2 n4r2 Blocks 1 2 3 Blocks 4 5 Standby NameNode 64MB node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 11 DataNodes Rack3
  • 12. What is MapReduce (MRv1)? • Programming Framework • Breaks processing into 2 phases: – Map phase – Reduce phase TaskTracker TaskTracker • Master-Slave architecture • Processes: – JobTracker (Master) – TaskTracker (Slave) © 2013. Axeldata Systems FZ-LLC JobTracker TaskTracker TaskTracker TaskTracker 12
  • 13. MapReduce Job JobTracker Task Task Task node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 13 TaskTrackers Rack3
  • 14. MapReduce: The Mapper • Is a function that performs the map phase • Each mapper usually operates on a single HDFS block • Takes a key and value as input can generate multiple keys and values as output • <k1,v1>  list(<k2,v2>) • The output of all mappers are then sorted by key © 2013. Axeldata Systems FZ-LLC 14
  • 15. MapReduce: The Reducer • Is a function that performs the reduce phase • Each reducer operates on a portion of the output of all mappers • Takes a key with a list of all values as input and generates an aggregate of the values for each key • <k2,list(v2)>  list(<k3,v3>) © 2013. Axeldata Systems FZ-LLC 15
  • 16. MapReduce Data Flow Input HDFS sort Split 0 Output HDFS copy Map merge Reduce Part 0 Reduce Part 1 sort Split 1 Map merge sort Split 2 © 2013. Axeldata Systems FZ-LLC Map 16
  • 17. HDFS & MapReduce Example: Word Count Original File I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. © 2013. Axeldata Systems FZ-LLC File on HDFS Mapper I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. Map And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. Map I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. Map Reduce Reduce Reduce 17 Output
  • 18. Demo: Word Count on Hadoop © 2013. Axeldata Systems FZ-LLC 18
  • 19. Querying Data in Hadoop Apache Hive Apache Pig • Developed at Facebook • Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL • Developed at Yahoo! • High-level platform for creating MapReduce programs used with Hadoop • Has a language called PigLatin • Can be extended with UDFs written in Java, Python and other languages © 2013. Axeldata Systems FZ-LLC 19
  • 20. Hadoop Ecosystem • Avro: a data serialization system • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. • HBase: a scalable, distributed database that supports structured data storage for large tables • Mahout: a Scalable machine learning and data mining library • Oozie: a workflow scheduler system to manage Apache Hadoop jobs. • Sqoop: a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Zookeeper: a high-performance coordination service for distributed applications © 2013. Axeldata Systems FZ-LLC 20
  • 21. Yet Another Resource Negotiator (YARN) – Also known as: YARN (MapReduce v2) – New framework that facilitates writing arbitrary distributed processing frameworks and applications. – Splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. – Can run applications that do not follow the MapReduce model © 2013. Axeldata Systems FZ-LLC 21
  • 22. Learn Hadoop • Download the Cloudera QuickStart VM – – – – http://bit.ly/1b00iZj To make it easy for you to get started with Hadoop Cloudera Distribution including Apache Hadoop (CDH) With Cloudera Manager, Cloudera Impala, and Cloudera Search, this virtual machine includes everything you need • Formal training as Developer, Administrator, Analyst and other • Free Courseware on Udacity: Introduction to Hadoop and MapReduce – https://www.udacity.com/course/ud617 © 2013. Axeldata Systems FZ-LLC 22
  • 23. Other Hadoop Resources Apache Project Websites Hadoop: Hive: Pig: Sqoop: Flume: http://hadoop.apache.org/ http://hive.apache.org/ http://pig.apache.org/ http://sqoop.apache.org/ http://flume.apache.org/ Original GFS and MapReduce Papers GFS: http://bit.ly/VZk9VL MapReduce: http://bit.ly/8VDMHO © 2013. Axeldata Systems FZ-LLC 23
  • 24. Community A community of Hadoop professionals and users in the region meetup.com/Hadoop-User-Group-UAE/ © 2013. Axeldata Systems FZ-LLC 24
  • 25. Q&A © 2013. Axeldata Systems FZ-LLC 25
  • 26. fadi@axeldata.com www.axeldata.com Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks are the property of their respective owners.

Editor's Notes

  • #5: In a nutshell, Hadoop grew out of research at Google, which got adopted by the Open Source community, and supported by heavyweights such as Yahoo!, Facebook and others. It had 6 years to mature.
  • #6: No it’s not Charles Darwin Hadoop was named after the creator’s son’s toy elephant.
  • #7: So What is Hadoop?
  • #12: Brief description of the operation of HDFS. There are 3 main components (daemons) in HDFS: NameNode, DataNode, and Secondary NameNode.
  • #14: There are 2 main components in MapReduce (daemons): JobTracker and TaskTracker
  • #17: MapReduce is composed of Map tasks and Reduce tasks. Those tasks run in parallel and do not depend on each other’s output.
  • #24: The major resources to start learning more about Hadoop. Also recommended is reading the research papers from Google that spurred the whole Hadoop ecosystem (by SanjarGhemawat and Jeff Dean).
  • #26: Q&amp;A with the famous Hadoop elephant mascot