Hadoop_EcoSystem_Pradeep_MG

• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• HDFS
• Hive
• Use case
• Conclusion
Agenda
2

• Big Data is NOT JUST ABOUT SIZE its ABOUT
HOW IMPORTANT THE DATA IS in a large
chunk
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!
3

• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
5

• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
service.
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
6

• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
7

• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce
8

• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
9

MR as representation
10
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
Values

• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
11

Data flow in MR
13
MapReduce data flow with multiple reduce tasks

• Architecture
HDFS Hadoop Distributed File System
15

• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
18

• Job Configuration
• Key files core-site.xml, mapred-
site.xml
• Specific job configuration can be
provided in the code
Map Reduce cont.
19

• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
21

• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
22

• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
scheduled
• One queue may be child of another
queue
• Enforces fair scheduling within each job
pool
Capacity scheduler
23

• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
25

• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
projects.
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
cluster
HIVE
26

• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
dev
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
27

• Creating table
• CREATE TABLE rank_customer(custid STRING,
socre STRING, location STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• Load Data
• LOAD DATA LOCAL INPATH
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
Commands
30

• SELECT QUERY
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
31

• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
'2014-10-03');
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
logmsgs;
Hive-
DDL Commands
32

• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
rank.
• Input Output
Use case
33

• Custom Writable
Using Map Reduce
34

• CustomWritable methods overridden
CustomWritable cont.
35

• ## FOR OBTINING THE RANKING ON THE BASIS OF
LOCATION AND CUSTOMER ID AS PER THE
REQUIREMENT
• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
41

• Hadoop eco system is majorly designed
for large number of files of large size of
data
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
data
• Mapping and Reducing are the key and
core functions to achieve parallelism.
Conclusion
43

• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
utilized.
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.
44

• Hadoop: The Definitive Guide, Third
Edition by Tom White
• Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen
• http://hadoop.apache.org/
• http://hive.apache.org/
References
45

Hadoop_EcoSystem_Pradeep_MG

More Related Content

What's hot (19)

Viewers also liked (13)

Similar to Hadoop_EcoSystem_Pradeep_MG (20)

Hadoop_EcoSystem_Pradeep_MG