1
Hadoop Eco System
• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• HDFS
• Hive
• Use case
• Conclusion
Agenda
2
• Big Data is NOT JUST ABOUT SIZE its ABOUT
HOW IMPORTANT THE DATA IS in a large
chunk
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!
3
Glimpse
4
• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
5
• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
service.
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
6
• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
7
• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce
8
• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
9
MR as representation
10
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
Values
• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
11
Phases of MR depicted
12
Data flow in MR
13
MapReduce data flow with multiple reduce tasks
Shuffle and Sort phase
14
• Architecture
HDFS Hadoop Distributed File System
15
HDFS- Client Read
16
HDFS- Client Write
17
• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
18
• Job Configuration
• Key files core-site.xml, mapred-
site.xml
• Specific job configuration can be
provided in the code
Map Reduce cont.
19
MR job in action
20
• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
21
• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
22
• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
scheduled
• One queue may be child of another
queue
• Enforces fair scheduling within each job
pool
Capacity scheduler
23
Map reduce Input Formats
24
• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
25
• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
projects.
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
cluster
HIVE
26
• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
dev
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
27
Hive Modules
28
Hive Data Types
29
• Creating table
• CREATE TABLE rank_customer(custid STRING,
socre STRING, location STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• Load Data
• LOAD DATA LOCAL INPATH
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
Commands
30
• SELECT QUERY
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
31
• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
'2014-10-03');
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
logmsgs;
Hive-
DDL Commands
32
• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
rank.
• Input Output
Use case
33
• Custom Writable
Using Map Reduce
34
• CustomWritable methods overridden
CustomWritable cont.
35
Driver code
36
Mapper Code
37
Partitioner Code
38
Sort Comparator class
39
Reducer Code
40
• ## FOR OBTINING THE RANKING ON THE BASIS OF
LOCATION AND CUSTOMER ID AS PER THE
REQUIREMENT
• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
41
Hive results
42
• Hadoop eco system is majorly designed
for large number of files of large size of
data
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
data
• Mapping and Reducing are the key and
core functions to achieve parallelism.
Conclusion
43
• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
utilized.
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.
44
• Hadoop: The Definitive Guide, Third
Edition by Tom White
• Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen
• http://hadoop.apache.org/
• http://hive.apache.org/
References
45
46
THANK YOU
Q&A
PRADEEP M G

More Related Content

PPTX
Hadoop course curriculm
PPTX
Hadoop eco system-first class
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PDF
Hadoop trainting in hyderabad@kelly technologies
PDF
Migrating structured data between Hadoop and RDBMS
PDF
Big Data and Hadoop Ecosystem
PPTX
Hadoop workshop
Hadoop course curriculm
Hadoop eco system-first class
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Hadoop trainting in hyderabad@kelly technologies
Migrating structured data between Hadoop and RDBMS
Big Data and Hadoop Ecosystem
Hadoop workshop

What's hot (19)

PPTX
Asbury Hadoop Overview
PPTX
Map Reduce
PPTX
Big data
PDF
Introduce to spark
PPTX
Apache Hadoop Big Data Technology
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Getting started big data
PPTX
Overview of Spark for HPC
PPTX
Hadoop
PPTX
Hadoop MapReduce Streaming and Pipes
PPTX
MapReduce basic
PPT
Nextag talk
PPTX
Basic Hadoop Architecture V1 vs V2
PPTX
Hadoop
PPTX
Cloud Optimized Big Data
PPTX
2012 apache hadoop_map_reduce_windows_azure
PDF
Introduction to apache spark
PDF
Map Reduce Execution Architecture
PPTX
Intro to Spark
Asbury Hadoop Overview
Map Reduce
Big data
Introduce to spark
Apache Hadoop Big Data Technology
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
Getting started big data
Overview of Spark for HPC
Hadoop
Hadoop MapReduce Streaming and Pipes
MapReduce basic
Nextag talk
Basic Hadoop Architecture V1 vs V2
Hadoop
Cloud Optimized Big Data
2012 apache hadoop_map_reduce_windows_azure
Introduction to apache spark
Map Reduce Execution Architecture
Intro to Spark
Ad

Viewers also liked (13)

PPTX
Intro to BigData , Hadoop and Mapreduce
PDF
Hadoop Map Reduce Arch
PDF
Pyshark in Network Packet analysis
PPTX
Introduction to map reduce
PPSX
MapReduce Scheduling Algorithms
PDF
An Introduction to MapReduce
PDF
Mapreduce Algorithms
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
PPT
Hadoop Map Reduce
PDF
introduction to data processing using Hadoop and Pig
PPT
Hadoop MapReduce Fundamentals
PPT
Introduction To Map Reduce
PPTX
Slideshare ppt
Intro to BigData , Hadoop and Mapreduce
Hadoop Map Reduce Arch
Pyshark in Network Packet analysis
Introduction to map reduce
MapReduce Scheduling Algorithms
An Introduction to MapReduce
Mapreduce Algorithms
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Hadoop Map Reduce
introduction to data processing using Hadoop and Pig
Hadoop MapReduce Fundamentals
Introduction To Map Reduce
Slideshare ppt
Ad

Similar to Hadoop_EcoSystem_Pradeep_MG (20)

PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PDF
Hadoop ecosystem
PPTX
Hadoop and big data training
PPTX
Presentation sreenu dwh-services
PPT
Hadoop - Introduction to Hadoop
PPTX
Big data Hadoop
ODP
Training
PDF
Hadoop introduction
PPT
Hive @ Hadoop day seattle_2010
PDF
Hadoop - How It Works
PPTX
Big Data and Hadoop
PDF
Big_data_1674238705.ppt is a basic background
PDF
Hadoop Master Class : A concise overview
PPTX
Introduction to Hadoop and Big Data
PPT
Hadoop online-training
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Hadoop-part1 in cloud computing subject.pptx
PPT
Hadoop presentation
PDF
MapReduce Improvements in MapR Hadoop
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Hadoop ecosystem
Hadoop and big data training
Presentation sreenu dwh-services
Hadoop - Introduction to Hadoop
Big data Hadoop
Training
Hadoop introduction
Hive @ Hadoop day seattle_2010
Hadoop - How It Works
Big Data and Hadoop
Big_data_1674238705.ppt is a basic background
Hadoop Master Class : A concise overview
Introduction to Hadoop and Big Data
Hadoop online-training
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop-part1 in cloud computing subject.pptx
Hadoop presentation
MapReduce Improvements in MapR Hadoop

Hadoop_EcoSystem_Pradeep_MG

  • 2. • Why Big Data? • Ingredients of Big Data Eco System • Working with Map Reduce • Phases of MR • HDFS • Hive • Use case • Conclusion Agenda 2
  • 3. • Big Data is NOT JUST ABOUT SIZE its ABOUT HOW IMPORTANT THE DATA IS in a large chunk • Data is CHANGING and getting MESSY • Prior Structured but now Unstructured. • Non Uniform • Many distributed contributors to the data • Mobile, PDA, Tablet, sensors. • Domains: Financial, Healthcare, Social Media Why Big Data!! 3
  • 5. • Map reduce – Technique of solving big data by map – reduce technique on clusters. • HDFS- Distributed file system used by hadoop. • HIVE- SQL based query engine for non java programmers • PIG- A data flow language and execution environment for exploring very large datasets Ingredients of Eco System 5
  • 6. • HBASE - A distributed, column-oriented database. • Zookeeper - A distributed, highly available coordination service. • Sqoop - A tool for efficiently moving data between relational databases and HDFS. Ingredients cont. 6
  • 7. • Protocols used- RPC/ HTTP for inter communication of commodity hardware. • Run on Pseudo Node or Clusters • Components- Daemons • NameNode • DataNode • JobTracker • TaskTracker Hadoop Internals 7
  • 8. • Map  Function which maps for each of the data available • Reduce  Function which is used for aggregation or reduction Working with Map Reduce 8
  • 9. • f(n) = Σ {n=0.. n=10} (n(n-1)/2) • map = ∀ n from 0 to n • map(n(n-1)/2) • Reduce = Σ ([values]) is the aggregation/reduction function Hence can achieve parallelism MR as a function 9
  • 10. MR as representation 10 • Map <K1, V1>  Map <K2, V2> • V2 – list of values for Key K2 • Reduce <K2, V2>  ~ <K3, V3> • ~ Reduction operation • Reduced output with specific Keys and Values
  • 11. • Data on HDFS • Input partition – FileSplit , Inputsplit • Map • Shuffle • Sort • Partition • Reducer • Aggregated Data on HDFS Phases of MR 11
  • 12. Phases of MR depicted 12
  • 13. Data flow in MR 13 MapReduce data flow with multiple reduce tasks
  • 14. Shuffle and Sort phase 14
  • 15. • Architecture HDFS Hadoop Distributed File System 15
  • 18. • List all the files and directories in the HDFS • $hadoop fs –lsr • Put file to HDFS • $hadoop fs –put <from path> <to path> • Get files from HDFS • $hadoop fs –get <from path> • To run jar file • $hadoop jar <jarfile> <className> <input path> <output path> HDFS - cli 18
  • 19. • Job Configuration • Key files core-site.xml, mapred- site.xml • Specific job configuration can be provided in the code Map Reduce cont. 19
  • 20. MR job in action 20
  • 21. • Job Scheduling • Fair scheduler • Capacity scheduler Job Scheduling 21
  • 22. • Job is planned and placed in the job pool • Supports preemption • If no pools created and only one job available, the job runs as is Fair Scheduler 22
  • 23. • Supports Multi user scheduling • Depends on the clusters, number of queues and hierarchical way jobs are scheduled • One queue may be child of another queue • Enforces fair scheduling within each job pool Capacity scheduler 23
  • 24. Map reduce Input Formats 24
  • 25. • Map Side Join • large inputs works by performing the join before the data reaches the map function • Reducer Side Join • input datasets don’t have to be structured in any particular way, but it is less efficient as both datasets have to go through the Map Reduce shuffle. MR Joins 25
  • 26. • Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) • From Developers of Facebook and later associated it part of apache open source projects. • Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster HIVE 26
  • 27. • Unzip the gz file • % tar xzf hive-x.y.z-dev.tar.gz • Be handy • % export HIVE_INSTALL=/home/tom/hive-x.y.z- dev • % export PATH=$PATH:$HIVE_INSTALL/bin • Hive shell launched • hive> Show tables; Hive Infrastructure 27
  • 30. • Creating table • CREATE TABLE rank_customer(custid STRING, socre STRING, location STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • Load Data • LOAD DATA LOCAL INPATH 'input/dir/customerrank.dat‘ OVERWRITE INTO TABLE rank_customer; • Check data in warehouse • $ls /user/hive/warehouse/records/ Commands 30
  • 31. • SELECT QUERY • SELECT c.custid, c.score, c.location FROM rank_customer c ORDER BY c.custid ASC, c.location ASC, c.score DESC; Commands cont. 31
  • 32. • hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = MGP', 'date' = '2014-10-03'); • hive> DROP DATABASE IF EXISTS financials; • hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba'); • hive> DROP TABLE IF EXISTS employees; • hive> ALTER TABLE log_messages RENAME TO logmsgs; Hive- DDL Commands 32
  • 33. • Determine the rank of the customer based on his id and the locality he belongs. Highest scorer gains the higher rank. • Input Output Use case 33
  • 34. • Custom Writable Using Map Reduce 34
  • 35. • CustomWritable methods overridden CustomWritable cont. 35
  • 41. • ## FOR OBTINING THE RANKING ON THE BASIS OF LOCATION AND CUSTOMER ID AS PER THE REQUIREMENT • hive>SELECT custid, score, location, rank() over(PARTITION BY custid, location ORDER BY score DESC ) AS myrank FROM rank_customer; Hive Query 41
  • 43. • Hadoop eco system is majorly designed for large number of files of large size of data • Not so suitable for small sized large number of files. • Achieving the parallelism on the huge data • Mapping and Reducing are the key and core functions to achieve parallelism. Conclusion 43
  • 44. • Hadoop eco system works efficiently with commodity hardware. • Distributed hardware can be efficiently utilized. • Hadoop map reduce codes are written using Java. • Hive gives feasibility for SQL programmers though internally Java MR jobs run. Conclusion cont. 44
  • 45. • Hadoop: The Definitive Guide, Third Edition by Tom White • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen • http://hadoop.apache.org/ • http://hive.apache.org/ References 45