Presentation sreenu dwh-services

Hadoop Eco System
Presented by : Sreenu Musham
27th March, 2015
Sreenu Musham
Data Warehouse Architect
sreenu.musham@yahoo.com
Hadoop Introduction

Agenda
BigData and Its Challenges(Recap)
Hadoop and Its Evolution
Terminology used
HDFS
MapReduce
Hadoop Eco System
Hadoop Distributors
Feel of Hadoop (how it looks?)

Big Data and Challenges
 1024 KB = 1MB
 1024 MB = 1GB
 1024 GB = 1TB
 1024 TB = 1Petabyte
 So on … Exabytes, Zettabytes, Yottabytes, Brontobytes,
Geopbytes

Big Data and Challenges
Issues on Disk I/O, Network &
Processing in time
Storage? Costly in enterprise machine
Vertical solution is not always correct
Reliability
handling unstructured data?
Schema less data

Disk Vs Transfer Rate
126 sec to read
whole disk
58 min to read
whole disk
4 hrs to read
whole disk

What is Hadoop?
 Apache Open source Software Framework for reliable,
scalable, distributed computing of massive amount of data
 A framework where the job is divided among the nodes and
process them in parallel
 Hides underlying system details and complexities from
user
 Developed in JAVA
 A set of machines running on HDFS and MapReduce is
known as Hadoop cluster
 Core Components
 HDFS MapReduce

Hadoop is not for all type of work
• Process transactions
• Low-Latency data Access
• Lot of Small Files
• Intensive calculations with little data
 Hadoop initiated new kinds of analysis
• Not jus old thinking on bigger data
• Iterate over whole data sets, not only sample sets
• Use multiple data sources (beyond structured
data)
• Work with schema-less data

Story Behind hadoop
Google’s Victory in 2000
Searching in 1990’s

Story Behind Hadoop
2003 2004 2005
Created by
Doug Cutting and Michael Cafarella
(Yahoo)
2006
Yahoo donated the project
to
Paper
on GFS
paper
on MapReduce
2008 2009
Name
by
Doug
Terabyte Sort
2010
Launches
HIVE
Runs 4000 node
Hadoop cluster

Node-Rack-cluster
HA
Node 1
Node 2
Node N
..
…..
R
A
C
K
1
Node 1
Node 2
Node N
..
…..
R
A
C
K
2
Node 1
Node 2
Node N
..
…..
R
A
C
K
N
.. …
Hadoop Cluster

Computation method
Data
Source
Server
Data
Hadoop cluster
I/O
Processing Time

What is HDFS?
HDFS runs on top of existing file system
Uses blocks to store a file or parts of file
Stores data across multiple nodes
Size of a file can be larger than any single disk in the
network
The default block size is 64MB
The Main Objectives
Storing Very large files
Streaming data access
Commodity hardware
Allow access to data on any node in the cluster
Able to handle hardware failures

What is HDFS?
If a chunk of the file is smaller than HDFS block size
 Only the needed space is used
Example: 300MB
HDFS blocks are replicated to several nodes for reliability
Maintains checksums of data for corruption detection and
recovery

What is MapReduce?
Is a algorithm/programming model to process the data in
the hadoop cluster
Consists of two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overal
dataset
•Typically one HDFS block of data
After all Maps are complete, the MapReduce system
disstribute the intermediate data to nodes which perform
the Reduce phase
Hadoop framework parallelizes the computation, handles
failures, efficient communication and performance issues.

Hadoop Architecture
Data
Node
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
Scalable
Master
Slaves
Master/Slave and
Shared Nothing architecture
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
HDFS Layer
MapReduce Layer

Sample Data Flow- Word Count
Air
Box
Car
Box
Do
Air
Air,2
Box,2
Car,1
Do,1
(0, Air)
(1, Box)
(2, Car)
(3, Box)
(4, Do)
(5, Air)
(Air, [1, 1])
(Box, [1, 1])
(Car, [1] )
(Do, [1] )
(Air, 2)
(Box, 2)
(Car, 1)
(Do, 1)
(Air, 1)
(Box, 1)
(Car, 1)
(Box, 1)
(Do, 1)
(Air, 1)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)

Sample Data Flow
(1949,
111)
(1950,
22)
(1949, [111, 78])
(1950, [0, 22,
−11])
(1949, 111)
(1950, 22)
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(1, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(2, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(3, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(4, 0043012650999991949032418004...0500001N9+00781+99999999999...)

How Hadoop runs a MapReduce Job

Replication of Blocks
Node 1
Node 2
Node N
R
A
C
K
2
Hadoop Cluster
Node 3
Node 1
Node 2
Node N
R
A
C
K
3 Node 3
Name
Node
Node 1
Node 3
R
A
C
K
1 Node 2
HDFS Client
file.txt
File.txt:B1,B2,B3
Name node

Accessing hadoop/HDFS
Hadoop fs –ls <path>
Hadoop fs mkdir testsreenu
Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put
samplefile.txt
Running mapreduce
hadoop jar newjob.jar samplefile in_dir out_dir
Hive
Pig
PIG code
a. Load customer records
Cust=LOAD ‘/input/custs’ using PigStorage(,) AS
(custid:chararray,firstname:chararray,lastname:chararray,age:long,profe
ssion:chararray);
b. Select only 100 records
Amt=LIMIT cust 100;
Dump amt;
c. Group customer records by profession
groupbyprofession = GROUP cust BY profession;
describe groupbyprofession;
Feel of Hadoop

Hadoop Vs Other Systems
29
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands
of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing
infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s
demand
• Example: Amazon EC2
Elastic:grows and shrinks based on user’s
demand

Q & A
Hadoop Eco System
Presented By
Sreenu Musham

Presentation sreenu dwh-services

More Related Content

What's hot (20)

Similar to Presentation sreenu dwh-services (20)

Recently uploaded (20)

Presentation sreenu dwh-services