SlideShare a Scribd company logo
B.V.V Sangha’s
BASAVESHWAR ENGINEERING COLLEGE(Autonomous),BAGALKOT-587103
Department of Computer Science and Engineering
TUTORIALASSIGNMENT – 5
“Seminar on Hadoop Ecosystem MapReduce”
Submitted by: Submitted to:
Shweta Policepatil Dr. S M Hatture
M.Tech 1st Sem CSE Course Coordinator
Roll No:09
Contents
Overview
What is MapReduce?
MapReduce Data flow
Word Count Example
Features
Implementations
MapReduce v/s. Parallel Database
Conclusion
Overview
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets. Approached to collect
and analyze website data for search optimization.
Apache, the open source organization, began
using MapReduce in the “Nutch” project, which
is an open source web search engine that still is
active today.
What is MapReduce?
It is a programming model.
Provides
-Automatic Parallelization and Distribution.
Hadoop is capable of running MapReduce
programs written in various languages:
-Java, Ruby, Python, and C++
Performs large-scale data analysis.
Map/Reduce Dataflow
Example :Word Count
Consider the problem of counting the number
of occurrences of each word in a large
collection of documents
How would you do it in parallel ?
Solution:
- Divide documents among workers
- Each worker parses document to find all
words, outputs pairs..
- Partition (word, count) pairs across workers
based on word.
- For each word at a worker, locally add up
counts.
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers of
a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step in
a long-term space-based
man/mache partnership.
'"The work we're doing now
-- the robotics we're doing -
- is what we're going to
need ……………………..
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
….
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to the
key and output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)
Sequentially
read
the
data
Only
sequential
reads
MapReduce Word Count Scenario
Features
Java and C++ APIs
Each task can process data sets larger than RAM
Automatic re-execution on failure
- In a large cluster, some nodes are always slow
- Framework re-executes failed tasks
Locality optimizations
- Queries HDFS for locations of input data
- Tasks are scheduled close to the inputs when
possible.
Implementations
Google
- Not available outside Google
Hadoop
-An open-source implementation in Java
- Uses HDFS for stable storage
And several others, such as Cassandra at
Facebook, etc.
MapReduce v/s Parallel Databases
MapReduce Parallel Database
Designed for unstructured
data
Designed for structured
relational data
MapReduce programs written
in a variety of
languages(some SQL support)
SQL
Materializes results between
Map and Reduce phases
Pipelines results between
operators
Determined by data storage
block size(
Runtime scheduler)
Entire Query
Data
Query
Interface
Query
Execution
Job
Granularity
Conclusion
Here at the end we conclude that we
got to know about the MapReduce. MapReduce
is programming model, and why do we use
MapReduce instead of Parallel Databases. The
Distributed file system provides the platform for
MapReduce and MapReduce uses Name node
and Data node from the HDFS to operate on big
data to avoid the storing repeated data.
References
 https://mindmajix.com/mapreduce/history-and-advantages-of-hadoop-
mapreduce-programming
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google
File System, http://labs.google.com/papers/gfs.html
 https://www.talend.com/resources/what-is-mapreduce/
 https://www.ibm.com/analytics/hadoop/mapreduce
 https://research.google.com/archive/mapreduce-osdi04-slides/index.html
 https://ai.google/research/pubs/pub62
Thank You

More Related Content

PDF
Apache Giraph
PPTX
Introduction to MapReduce
PDF
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
PPTX
Introduction to Map Reduce
PPTX
2011.10.14 Apache Giraph - Hortonworks
PPT
Map Reduce introduction
PDF
Apache Hama at Samsung Open Source Conference
PPT
Map Reduce
Apache Giraph
Introduction to MapReduce
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Introduction to Map Reduce
2011.10.14 Apache Giraph - Hortonworks
Map Reduce introduction
Apache Hama at Samsung Open Source Conference
Map Reduce

What's hot (20)

PDF
Large Scale Graph Processing with Apache Giraph
PDF
Introduction to apache horn (incubating)
PDF
MapReduce Algorithm Design
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
PPTX
Developing a Map Reduce Application
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
PPT
Hadoop Map Reduce
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
PDF
Introduction of Apache Hama - 2011
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
PDF
Challenges on Distributed Machine Learning
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PPTX
Apache Hadoop Big Data Technology
PDF
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
PPTX
Big Data Analytics-Open Source Toolkits
PPT
Map Reduce
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PPTX
Map reduce in Hadoop
PDF
Map Reduce
Large Scale Graph Processing with Apache Giraph
Introduction to apache horn (incubating)
MapReduce Algorithm Design
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Developing a Map Reduce Application
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Hadoop Map Reduce
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Introduction of Apache Hama - 2011
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Challenges on Distributed Machine Learning
Introducing Apache Giraph for Large Scale Graph Processing
Apache Hadoop Big Data Technology
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Big Data Analytics-Open Source Toolkits
Map Reduce
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Map reduce in Hadoop
Map Reduce
Ad

Similar to Tutorial5 (20)

PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
This gives a brief detail about big data
PPT
MapReduce in cgrid and cloud computinge.ppt
PPTX
introduction to Complete Map and Reduce Framework
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPT
Behm Shah Pagerank
PPTX
Introduction to Hadoop and MapReduce
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PDF
Report Hadoop Map Reduce
PPTX
Big data & hadoop
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PDF
Mapreduce2008 cacm
PPTX
Map reduce and Hadoop on windows
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
PPTX
Big data analytics involves examining large, complex datasets
PDF
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
PPTX
map Reduce.pptx
ODP
Hadoop - Overview
Hadoop and Mapreduce for .NET User Group
This gives a brief detail about big data
MapReduce in cgrid and cloud computinge.ppt
introduction to Complete Map and Reduce Framework
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Behm Shah Pagerank
Introduction to Hadoop and MapReduce
L19CloudMapReduce introduction for cloud computing .ppt
Report Hadoop Map Reduce
Big data & hadoop
Lecture2-MapReduce - An introductory lecture to Map Reduce
Mapreduce is for Hadoop Ecosystem in Data Science
Mapreduce2008 cacm
Map reduce and Hadoop on windows
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
Big data analytics involves examining large, complex datasets
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
map Reduce.pptx
Hadoop - Overview
Ad

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Tutorial5

  • 1. B.V.V Sangha’s BASAVESHWAR ENGINEERING COLLEGE(Autonomous),BAGALKOT-587103 Department of Computer Science and Engineering TUTORIALASSIGNMENT – 5 “Seminar on Hadoop Ecosystem MapReduce” Submitted by: Submitted to: Shweta Policepatil Dr. S M Hatture M.Tech 1st Sem CSE Course Coordinator Roll No:09
  • 2. Contents Overview What is MapReduce? MapReduce Data flow Word Count Example Features Implementations MapReduce v/s. Parallel Database Conclusion
  • 3. Overview MapReduce is a programming model and an associated implementation for processing and generating large data sets. Approached to collect and analyze website data for search optimization. Apache, the open source organization, began using MapReduce in the “Nutch” project, which is an open source web search engine that still is active today.
  • 4. What is MapReduce? It is a programming model. Provides -Automatic Parallelization and Distribution. Hadoop is capable of running MapReduce programs written in various languages: -Java, Ruby, Python, and C++ Performs large-scale data analysis.
  • 6. Example :Word Count Consider the problem of counting the number of occurrences of each word in a large collection of documents How would you do it in parallel ?
  • 7. Solution: - Divide documents among workers - Each worker parses document to find all words, outputs pairs.. - Partition (word, count) pairs across workers based on word. - For each word at a worker, locally add up counts.
  • 8. The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing - - is what we're going to need …………………….. Big document (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output (key, value) Provided by the programmer Provided by the programmer (key, value) (key, value) Sequentially read the data Only sequential reads MapReduce Word Count Scenario
  • 9. Features Java and C++ APIs Each task can process data sets larger than RAM Automatic re-execution on failure - In a large cluster, some nodes are always slow - Framework re-executes failed tasks Locality optimizations - Queries HDFS for locations of input data - Tasks are scheduled close to the inputs when possible.
  • 10. Implementations Google - Not available outside Google Hadoop -An open-source implementation in Java - Uses HDFS for stable storage And several others, such as Cassandra at Facebook, etc.
  • 11. MapReduce v/s Parallel Databases MapReduce Parallel Database Designed for unstructured data Designed for structured relational data MapReduce programs written in a variety of languages(some SQL support) SQL Materializes results between Map and Reduce phases Pipelines results between operators Determined by data storage block size( Runtime scheduler) Entire Query Data Query Interface Query Execution Job Granularity
  • 12. Conclusion Here at the end we conclude that we got to know about the MapReduce. MapReduce is programming model, and why do we use MapReduce instead of Parallel Databases. The Distributed file system provides the platform for MapReduce and MapReduce uses Name node and Data node from the HDFS to operate on big data to avoid the storing repeated data.
  • 13. References  https://mindmajix.com/mapreduce/history-and-advantages-of-hadoop- mapreduce-programming  Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, http://labs.google.com/papers/gfs.html  https://www.talend.com/resources/what-is-mapreduce/  https://www.ibm.com/analytics/hadoop/mapreduce  https://research.google.com/archive/mapreduce-osdi04-slides/index.html  https://ai.google/research/pubs/pub62