SlideShare a Scribd company logo
Hadoop Eco System
Presented by : Sreenu Musham
27th March, 2015
Sreenu Musham
Data Warehouse Architect
sreenu.musham@yahoo.com
Hadoop Introduction
Agenda
BigData and Its Challenges(Recap)
Hadoop and Its Evolution
Terminology used
HDFS
MapReduce
Hadoop Eco System
Hadoop Distributors
Feel of Hadoop (how it looks?)
Big Data and Challenges
 1024 KB = 1MB
 1024 MB = 1GB
 1024 GB = 1TB
 1024 TB = 1Petabyte
 So on … Exabytes, Zettabytes, Yottabytes, Brontobytes,
Geopbytes
Big Data and Challenges
Issues on Disk I/O, Network &
Processing in time
Storage? Costly in enterprise machine
Vertical solution is not always correct
Reliability
handling unstructured data?
Schema less data
Disk Vs Transfer Rate
126 sec to read
whole disk
58 min to read
whole disk
4 hrs to read
whole disk
What is Hadoop?
 Apache Open source Software Framework for reliable,
scalable, distributed computing of massive amount of data
 A framework where the job is divided among the nodes and
process them in parallel
 Hides underlying system details and complexities from
user
 Developed in JAVA
 A set of machines running on HDFS and MapReduce is
known as Hadoop cluster
 Core Components
 HDFS MapReduce
Hadoop is not for all type of work
• Process transactions
• Low-Latency data Access
• Lot of Small Files
• Intensive calculations with little data
 Hadoop initiated new kinds of analysis
• Not jus old thinking on bigger data
• Iterate over whole data sets, not only sample sets
• Use multiple data sources (beyond structured
data)
• Work with schema-less data
Warehouse Themes
Story Behind hadoop
Google’s Victory in 2000
Searching in 1990’s
Story Behind Hadoop
2003 2004 2005
Created by
Doug Cutting and Michael Cafarella
(Yahoo)
2006
Yahoo donated the project
to
Paper
on GFS
paper
on MapReduce
2008 2009
Name
by
Doug
Terabyte Sort
2010
Launches
HIVE
Runs 4000 node
Hadoop cluster
Node-Rack-cluster
HA
Node 1
Node 2
Node N
..
…..
R
A
C
K
1
Node 1
Node 2
Node N
..
…..
R
A
C
K
2
Node 1
Node 2
Node N
..
…..
R
A
C
K
N
.. …
Hadoop Cluster
Computation method
Data
Source
Server
Data
Hadoop cluster
I/O
Processing Time
What is HDFS?
HDFS runs on top of existing file system
Uses blocks to store a file or parts of file
Stores data across multiple nodes
Size of a file can be larger than any single disk in the
network
The default block size is 64MB
The Main Objectives
Storing Very large files
Streaming data access
Commodity hardware
Allow access to data on any node in the cluster
Able to handle hardware failures
What is HDFS?
If a chunk of the file is smaller than HDFS block size
 Only the needed space is used
Example: 300MB
HDFS blocks are replicated to several nodes for reliability
Maintains checksums of data for corruption detection and
recovery
What is MapReduce?
Is a algorithm/programming model to process the data in
the hadoop cluster
Consists of two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overal
dataset
•Typically one HDFS block of data
After all Maps are complete, the MapReduce system
disstribute the intermediate data to nodes which perform
the Reduce phase
Hadoop framework parallelizes the computation, handles
failures, efficient communication and performance issues.
Sample Data Flow
Take a large problem and divide it into sub-problems
-Break data set down into small chunks
Perform the same function on all sub-tasks
REDUCEMAP
Combine the output from all sub-tasks
It works like a Unix pipeline
cat input | grep <pattern> | sort | uniq -c > output
Input | Map | Shuffle & Sort | Reduce | Output
Hadoop Architecture
Data
Node
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
Scalable
Master
Slaves
Master/Slave and
Shared Nothing architecture
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
HDFS Layer
MapReduce Layer
Sample Data Flow- Word Count
Air
Box
Car
Box
Do
Air
Air,2
Box,2
Car,1
Do,1
(0, Air)
(1, Box)
(2, Car)
(3, Box)
(4, Do)
(5, Air)
(Air, [1, 1])
(Box, [1, 1])
(Car, [1] )
(Do, [1] )
(Air, 2)
(Box, 2)
(Car, 1)
(Do, 1)
(Air, 1)
(Box, 1)
(Car, 1)
(Box, 1)
(Do, 1)
(Air, 1)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
Sample Data Flow
(1949,
111)
(1950,
22)
(1949, [111, 78])
(1950, [0, 22,
−11])
(1949, 111)
(1950, 22)
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(1, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(2, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(3, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
MapReduce : on splits
How Hadoop runs a MapReduce Job
Reading Data from HDFS
Writing Data to HDFS
Replication of Blocks
Node 1
Node 2
Node N
R
A
C
K
2
Hadoop Cluster
Node 3
Node 1
Node 2
Node N
R
A
C
K
3 Node 3
Name
Node
Node 1
Node 3
R
A
C
K
1 Node 2
HDFS Client
file.txt
File.txt:B1,B2,B3
Name node
Hadoop Eco System
Hadoop Distributions.
Hadoop Versions
Accessing hadoop/HDFS
Hadoop fs –ls <path>
Hadoop fs mkdir testsreenu
Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put
samplefile.txt
Running mapreduce
hadoop jar newjob.jar samplefile in_dir out_dir
Hive
Pig
PIG code
a. Load customer records
Cust=LOAD ‘/input/custs’ using PigStorage(,) AS
(custid:chararray,firstname:chararray,lastname:chararray,age:long,profe
ssion:chararray);
b. Select only 100 records
Amt=LIMIT cust 100;
Dump amt;
c. Group customer records by profession
groupbyprofession = GROUP cust BY profession;
describe groupbyprofession;
Feel of Hadoop
Hadoop Vs Other Systems
29
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands
of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing
infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s
demand
• Example: Amazon EC2
Elastic:grows and shrinks based on user’s
demand
Q & A
Hadoop Eco System
Presented By
Sreenu Musham

More Related Content

PPTX
Introduction to Hadoop part 2
PPSX
PDF
Introduction to Hadoop
PDF
Introduction to Hadoop part1
PPTX
Introduction to Hadoop Technology
PPT
Hadoop 1.x vs 2
PPT
An Introduction to Hadoop
PPTX
Hadoop technology
Introduction to Hadoop part 2
Introduction to Hadoop
Introduction to Hadoop part1
Introduction to Hadoop Technology
Hadoop 1.x vs 2
An Introduction to Hadoop
Hadoop technology

What's hot (20)

PPTX
Apache hadoop technology : Beginners
PPTX
002 Introduction to hadoop v3
PPT
Hadoop ppt2
PPTX
HDFS: Hadoop Distributed Filesystem
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PPTX
Introduction to Hadoop and Hadoop component
PDF
Hadoop-Introduction
ODP
Hadoop2.2
PDF
Basics of big data analytics hadoop
PDF
Introduction to Hadoop
PPTX
Introduction to Hadoop
PPTX
BIG DATA: Apache Hadoop
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPT
Hadoop
PPTX
Hadoop overview
PPT
Hadoop Tutorial
PPTX
Hadoop: Distributed Data Processing
PPT
Hadoop Technologies
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Apache hadoop technology : Beginners
002 Introduction to hadoop v3
Hadoop ppt2
HDFS: Hadoop Distributed Filesystem
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Introduction to Hadoop and Hadoop component
Hadoop-Introduction
Hadoop2.2
Basics of big data analytics hadoop
Introduction to Hadoop
Introduction to Hadoop
BIG DATA: Apache Hadoop
Overview of Big data, Hadoop and Microsoft BI - version1
Hadoop
Hadoop overview
Hadoop Tutorial
Hadoop: Distributed Data Processing
Hadoop Technologies
Introduction to Big Data & Hadoop Architecture - Module 1
Ad

Similar to Presentation sreenu dwh-services (20)

PDF
Lecture 2 part 1
PPTX
Hadoop introduction
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PPTX
Hadoop: An Industry Perspective
PPTX
Hadoop bigdata overview
PPTX
Hadoop and BigData - July 2016
PPTX
Hadoop and big data training
PPTX
Hands on Hadoop and pig
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
ODP
Hadoop seminar
PPTX
Introduction to Hadoop
PPT
Hadoop training by keylabs
PPTX
Etu L2 Training - Hadoop 企業應用實作
ODP
Hadoop demo ppt
PPTX
Big data Analytics Hadoop
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
PPTX
Hadoop_arunam_ppt
Lecture 2 part 1
Hadoop introduction
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Hadoop: An Industry Perspective
Hadoop bigdata overview
Hadoop and BigData - July 2016
Hadoop and big data training
Hands on Hadoop and pig
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Hadoop seminar
Introduction to Hadoop
Hadoop training by keylabs
Etu L2 Training - Hadoop 企業應用實作
Hadoop demo ppt
Big data Analytics Hadoop
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop_arunam_ppt
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf

Presentation sreenu dwh-services

  • 1. Hadoop Eco System Presented by : Sreenu Musham 27th March, 2015 Sreenu Musham Data Warehouse Architect sreenu.musham@yahoo.com Hadoop Introduction
  • 2. Agenda BigData and Its Challenges(Recap) Hadoop and Its Evolution Terminology used HDFS MapReduce Hadoop Eco System Hadoop Distributors Feel of Hadoop (how it looks?)
  • 3. Big Data and Challenges  1024 KB = 1MB  1024 MB = 1GB  1024 GB = 1TB  1024 TB = 1Petabyte  So on … Exabytes, Zettabytes, Yottabytes, Brontobytes, Geopbytes
  • 4. Big Data and Challenges Issues on Disk I/O, Network & Processing in time Storage? Costly in enterprise machine Vertical solution is not always correct Reliability handling unstructured data? Schema less data
  • 5. Disk Vs Transfer Rate 126 sec to read whole disk 58 min to read whole disk 4 hrs to read whole disk
  • 6. What is Hadoop?  Apache Open source Software Framework for reliable, scalable, distributed computing of massive amount of data  A framework where the job is divided among the nodes and process them in parallel  Hides underlying system details and complexities from user  Developed in JAVA  A set of machines running on HDFS and MapReduce is known as Hadoop cluster  Core Components  HDFS MapReduce
  • 7. Hadoop is not for all type of work • Process transactions • Low-Latency data Access • Lot of Small Files • Intensive calculations with little data  Hadoop initiated new kinds of analysis • Not jus old thinking on bigger data • Iterate over whole data sets, not only sample sets • Use multiple data sources (beyond structured data) • Work with schema-less data
  • 9. Story Behind hadoop Google’s Victory in 2000 Searching in 1990’s
  • 10. Story Behind Hadoop 2003 2004 2005 Created by Doug Cutting and Michael Cafarella (Yahoo) 2006 Yahoo donated the project to Paper on GFS paper on MapReduce 2008 2009 Name by Doug Terabyte Sort 2010 Launches HIVE Runs 4000 node Hadoop cluster
  • 11. Node-Rack-cluster HA Node 1 Node 2 Node N .. ….. R A C K 1 Node 1 Node 2 Node N .. ….. R A C K 2 Node 1 Node 2 Node N .. ….. R A C K N .. … Hadoop Cluster
  • 13. What is HDFS? HDFS runs on top of existing file system Uses blocks to store a file or parts of file Stores data across multiple nodes Size of a file can be larger than any single disk in the network The default block size is 64MB The Main Objectives Storing Very large files Streaming data access Commodity hardware Allow access to data on any node in the cluster Able to handle hardware failures
  • 14. What is HDFS? If a chunk of the file is smaller than HDFS block size  Only the needed space is used Example: 300MB HDFS blocks are replicated to several nodes for reliability Maintains checksums of data for corruption detection and recovery
  • 15. What is MapReduce? Is a algorithm/programming model to process the data in the hadoop cluster Consists of two phases: Map, and then Reduce Each Map task operates on a discrete portion of the overal dataset •Typically one HDFS block of data After all Maps are complete, the MapReduce system disstribute the intermediate data to nodes which perform the Reduce phase Hadoop framework parallelizes the computation, handles failures, efficient communication and performance issues.
  • 16. Sample Data Flow Take a large problem and divide it into sub-problems -Break data set down into small chunks Perform the same function on all sub-tasks REDUCEMAP Combine the output from all sub-tasks It works like a Unix pipeline cat input | grep <pattern> | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output
  • 17. Hadoop Architecture Data Node Name Node Data Node Data Node Data Node Data Node Scalable Master Slaves Master/Slave and Shared Nothing architecture Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker HDFS Layer MapReduce Layer
  • 18. Sample Data Flow- Word Count Air Box Car Box Do Air Air,2 Box,2 Car,1 Do,1 (0, Air) (1, Box) (2, Car) (3, Box) (4, Do) (5, Air) (Air, [1, 1]) (Box, [1, 1]) (Car, [1] ) (Do, [1] ) (Air, 2) (Box, 2) (Car, 1) (Do, 1) (Air, 1) (Box, 1) (Car, 1) (Box, 1) (Do, 1) (Air, 1) Input Outputmap Shuffle &Sort Reduce <k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
  • 19. Sample Data Flow (1949, 111) (1950, 22) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) Input Outputmap Shuffle &Sort Reduce <k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>) To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (1, 0043011990999991950051512004...9999999N9+00221+99999999999...) (2, 0043011990999991950051518004...9999999N9-00111+99999999999...) (3, 0043012650999991949032412004...0500001N9+01111+99999999999...) (4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
  • 20. MapReduce : on splits
  • 21. How Hadoop runs a MapReduce Job
  • 24. Replication of Blocks Node 1 Node 2 Node N R A C K 2 Hadoop Cluster Node 3 Node 1 Node 2 Node N R A C K 3 Node 3 Name Node Node 1 Node 3 R A C K 1 Node 2 HDFS Client file.txt File.txt:B1,B2,B3 Name node
  • 28. Accessing hadoop/HDFS Hadoop fs –ls <path> Hadoop fs mkdir testsreenu Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put samplefile.txt Running mapreduce hadoop jar newjob.jar samplefile in_dir out_dir Hive Pig PIG code a. Load customer records Cust=LOAD ‘/input/custs’ using PigStorage(,) AS (custid:chararray,firstname:chararray,lastname:chararray,age:long,profe ssion:chararray); b. Select only 100 records Amt=LIMIT cust 100; Dump amt; c. Group customer records by profession groupbyprofession = GROUP cust BY profession; describe groupbyprofession; Feel of Hadoop
  • 29. Hadoop Vs Other Systems 29 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2 Elastic:grows and shrinks based on user’s demand
  • 30. Q & A Hadoop Eco System Presented By Sreenu Musham