SlideShare a Scribd company logo
Hadoop/MapReduce
Computing Paradigm
1
CS525: Special Topics in DBs
Large-Scale Data Management
Presented By
Kelly Technologies
www.kellytechno.com
Large-Scale Data Analytics
2
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database
systems
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
www.kellytechno.com
Why Hadoop is able to compete?
3
Scalability (petabytes of data, thousands
of machines)
Database
vs.
Flexibility in accepting all data formats
(no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
www.kellytechno.com
What is Hadoop
4
Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit
www.kellytechno.com
What is Hadoop (Cont’d)
5
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
www.kellytechno.com
Hadoop Master/Slave Architecture
6
Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
www.kellytechno.com
Design Principles of Hadoop
7
Need to process big data
Need to parallelize computation across thousands of nodes
Commodity hardware
Large number of low-end cheap machines working in parallel to
solve a computing problem
This is in contrast to Parallel DBs
Small number of high-end expensive machines
www.kellytechno.com
Design Principles of Hadoop
8
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
www.kellytechno.com
How Uses MapReduce/Hadoop
9
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
www.kellytechno.com
Hadoop: How it Works
10 www.kellytechno.com
Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
www.kellytechno.com
Hadoop Distributed File System (HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
www.kellytechno.com
Main Properties of HDFS
13
Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
Replication: Each data block is replicated many
times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
Namenode is consistently checking Datanodes
www.kellytechno.com
Map-Reduce Execution Engine
(Example: Color Count)
14
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
www.kellytechno.com
Properties of MapReduce Engine
15
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
www.kellytechno.com
Properties of MapReduce Engine (Cont’d)
16
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
R e d u c e
R e d u c e
R e d u c e
M a p
M a p
M a p
M a p
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
In this example, 1 map-reduce job consists
of 4 map tasks and 3 reduce tasks
www.kellytechno.com
Key-Value Pairs
17
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
www.kellytechno.com
MapReduce Phases
18
Deciding on what will be the key and what will be the value  developer’s
responsibility
www.kellytechno.com
Example 1: Word Count
19
Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks
www.kellytechno.com
Example 2: Color Count
20
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3
parts on probably 3 different
machines www.kellytechno.com
Example 3: Color Filter
21
Job: Select only the blue and the green colors
Input blocks on
HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only the
blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it has 4
parts on probably 4 different
machines
www.kellytechno.com
Bigger Picture: Hadoop vs. Other Systems
22
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands of
machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can run on
the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
www.kellytechno.com
23
Presented By
Kelly Technologies
www.kellytechno.com

More Related Content

PPTX
PDF
Hadoop-Introduction
PPTX
Introduction to Map Reduce
PDF
Mapreduce Algorithms
PPTX
MapReduce basic
PPTX
Map reduce presentation
PPT
Map Reduce
PPTX
Analysing of big data using map reduce
Hadoop-Introduction
Introduction to Map Reduce
Mapreduce Algorithms
MapReduce basic
Map reduce presentation
Map Reduce
Analysing of big data using map reduce

What's hot (20)

PDF
MapReduce Algorithm Design
PPT
Hadoop Map Reduce
PDF
Map Reduce
PPTX
Introduction to MapReduce
PPTX
Map Reduce
PDF
Large Scale Data Analysis with Map/Reduce, part I
PDF
Mapreduce by examples
PDF
An Introduction to MapReduce
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Applying stratosphere for big data analytics
PPTX
Map Reduce Online
PPT
Introduction To Map Reduce
PPTX
Stratosphere with big_data_analytics
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PPT
Map Reduce introduction
PPT
Hadoop trainting-in-hyderabad@kelly technologies
PDF
Introduction to Map-Reduce
PPT
Map Reduce
PPTX
Map reduce and Hadoop on windows
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
MapReduce Algorithm Design
Hadoop Map Reduce
Map Reduce
Introduction to MapReduce
Map Reduce
Large Scale Data Analysis with Map/Reduce, part I
Mapreduce by examples
An Introduction to MapReduce
Hw09 Hadoop Development At Facebook Hive And Hdfs
Applying stratosphere for big data analytics
Map Reduce Online
Introduction To Map Reduce
Stratosphere with big_data_analytics
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Map Reduce introduction
Hadoop trainting-in-hyderabad@kelly technologies
Introduction to Map-Reduce
Map Reduce
Map reduce and Hadoop on windows
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Ad

Similar to Hadoop institutes-in-bangalore (20)

PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Hadoop
PPT
Hadoop online-training
PPTX
Hadoop-part1 in cloud computing subject.pptx
PPT
Hadoop - Introduction to HDFS
PPT
Hadoop and Mapreduce Introduction
PPTX
Big Data and Hadoop with MapReduce Paradigms
PPTX
PPTX
Hadoop
PPTX
Hadoop bigdata overview
PPT
PPTX
Big Data and Hadoop
PPTX
Distributed computing poli
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
Introduction to Hadoop Technology
DOCX
Hadoop Seminar Report
PDF
Report Hadoop Map Reduce
ODP
Hadoop - Overview
PPTX
Hadoop training-in-hyderabad
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Hadoop trainting in hyderabad@kelly technologies
Hadoop
Hadoop online-training
Hadoop-part1 in cloud computing subject.pptx
Hadoop - Introduction to HDFS
Hadoop and Mapreduce Introduction
Big Data and Hadoop with MapReduce Paradigms
Hadoop
Hadoop bigdata overview
Big Data and Hadoop
Distributed computing poli
Lecture2-MapReduce - An introductory lecture to Map Reduce
Introduction to Hadoop Technology
Hadoop Seminar Report
Report Hadoop Map Reduce
Hadoop - Overview
Hadoop training-in-hyderabad
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Ad

More from Kelly Technologies (20)

PPTX
Hadoop training-in-hyderabad
PPT
Data science training institute in hyderabad
PPT
Data science institutes in hyderabad
PPT
Data science training in hyderabad
PPT
Hadoop training institute in hyderabad
PPT
Hadoop institutes in hyderabad
PPT
Hadoop training in hyderabad-kellytechnologies
PPT
Sas training in hyderabad
PDF
Websphere mb training in hyderabad
PPT
Oracle training-institutes-in-hyderabad
PPT
Hadoop training institutes in bangalore
PPT
Hadoop training institute in bangalore
PPT
Tableau training in bangalore
PDF
Salesforce crm-training-in-bangalore
PPT
Oracle training in hyderabad
PDF
Qlikview training in hyderabad
PPT
Spark training-in-bangalore
PDF
Project Management Planning training in hyderabad
PPT
Hadoop training in bangalore
PDF
Oracle training in_hyderabad
Hadoop training-in-hyderabad
Data science training institute in hyderabad
Data science institutes in hyderabad
Data science training in hyderabad
Hadoop training institute in hyderabad
Hadoop institutes in hyderabad
Hadoop training in hyderabad-kellytechnologies
Sas training in hyderabad
Websphere mb training in hyderabad
Oracle training-institutes-in-hyderabad
Hadoop training institutes in bangalore
Hadoop training institute in bangalore
Tableau training in bangalore
Salesforce crm-training-in-bangalore
Oracle training in hyderabad
Qlikview training in hyderabad
Spark training-in-bangalore
Project Management Planning training in hyderabad
Hadoop training in bangalore
Oracle training in_hyderabad

Recently uploaded (20)

PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Structure & Organelles in detailed.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
TR - Agricultural Crops Production NC III.pdf
Microbial diseases, their pathogenesis and prophylaxis
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Cell Types and Its function , kingdom of life
102 student loan defaulters named and shamed – Is someone you know on the list?
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Supply Chain Operations Speaking Notes -ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Final Presentation General Medicine 03-08-2024.pptx
Insiders guide to clinical Medicine.pdf
Microbial disease of the cardiovascular and lymphatic systems
Cell Structure & Organelles in detailed.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
VCE English Exam - Section C Student Revision Booklet
Renaissance Architecture: A Journey from Faith to Humanism
O7-L3 Supply Chain Operations - ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES

Hadoop institutes-in-bangalore

  • 1. Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies www.kellytechno.com
  • 2. Large-Scale Data Analytics 2 MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications www.kellytechno.com
  • 3. Why Hadoop is able to compete? 3 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - …. www.kellytechno.com
  • 4. What is Hadoop 4 Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model, any data will fit www.kellytechno.com
  • 5. What is Hadoop (Cont’d) 5 Hadoop framework consists on two main layers Distributed file system (HDFS) Execution engine (MapReduce) www.kellytechno.com
  • 6. Hadoop Master/Slave Architecture 6 Hadoop is designed as a master-slave shared-nothing architecture Master node (single node) Many slave nodes www.kellytechno.com
  • 7. Design Principles of Hadoop 7 Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines www.kellytechno.com
  • 8. Design Principles of Hadoop 8 Automatic parallelization & distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction Users only provide two functions “map” and “reduce” www.kellytechno.com
  • 9. How Uses MapReduce/Hadoop 9 Google: Inventors of MapReduce computing paradigm Yahoo: Developing Hadoop open-source of MapReduce IBM, Microsoft, Oracle Facebook, Amazon, AOL, NetFlex Many others + universities and research labs www.kellytechno.com
  • 10. Hadoop: How it Works 10 www.kellytechno.com
  • 11. Hadoop Architecture 11 Master node (single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce) www.kellytechno.com
  • 12. Hadoop Distributed File System (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB) www.kellytechno.com
  • 13. Main Properties of HDFS 13 Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS Namenode is consistently checking Datanodes www.kellytechno.com
  • 14. Map-Reduce Execution Engine (Example: Color Count) 14 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions www.kellytechno.com
  • 15. Properties of MapReduce Engine 15 Job Tracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality) • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3 www.kellytechno.com
  • 16. Properties of MapReduce Engine (Cont’d) 16 Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting progress R e d u c e R e d u c e R e d u c e M a p M a p M a p M a p P a r s e - h a s h P a r s e - h a s h P a r s e - h a s h P a r s e - h a s h In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks www.kellytechno.com
  • 17. Key-Value Pairs 17 Mappers and Reducers are users’ code (provided functions) Just need to obey the Key-Value pairs interface Mappers: Consume <key, value> pairs Produce <key, value> pairs Reducers: Consume <key, <list of values>> Produce <key, value> Shuffling and Sorting: Hidden phase between mappers and reducers Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> www.kellytechno.com
  • 18. MapReduce Phases 18 Deciding on what will be the key and what will be the value  developer’s responsibility www.kellytechno.com
  • 19. Example 1: Word Count 19 Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks www.kellytechno.com
  • 20. Example 2: Color Count 20 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines www.kellytechno.com
  • 21. Example 3: Color Filter 21 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines www.kellytechno.com
  • 22. Bigger Picture: Hadoop vs. Other Systems 22 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2 www.kellytechno.com