SlideShare a Scribd company logo
Velammal College of Engineering and Technology
(Autonomous)
Department of Information Technology
21IT401
BIG DATA ENGINEERING
Syllabus
UNIT II - MAP REDUCE
HDFS Overview, Hadoop and Spark,
Map Reduce Programming Basics,
Analyzing the data with Hadoop: Java MapReduce -
Developing Map Reduce Application.
21IT401- BIG DATA ENGINEERING UNIT-II 2
HDFS Overview
• HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters on commodity hardware.
• HDFS is built around the idea that the most
efficient data processing pattern is a write-
once, read-many-times pattern.
21IT401- BIG DATA ENGINEERING UNIT-I 3
HDFS Overview
Blocks
A disk has a block size, which is the minimum amount
of data that it can read or write. File systems for a
single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size.
Benefits:
• A file can be larger than any single disk in the network.
• Second, making the unit of abstraction a block rather
than a file simplifies the storage subsystem.
21IT401- BIG DATA ENGINEERING UNIT-I 4
HDFS Overview
Name nodes and Data nodes:
• A name node (the master) and a number of data nodes
(workers).
• The name node manages the file system namespace. It
maintains the file system tree and the metadata for all
the files and directories in the tree.
• This information is stored persistently on the local disk
in the form of two files: the namespace image and the
edit log. The name node also knows the data nodes on
which all the blocks for a given file are located.
21IT401- BIG DATA ENGINEERING UNIT-I 5
HDFS Overview
The Command-Line Interface:
• There are many other interfaces to HDFS, but the
command line is one of the simplest, and to many
developers the most familiar.
• There are two properties that we set in the pseudo-
distributed configuration
• The first is fs.default.name, set to hdfs://localhost/,
which is used to set a default file system for Hadoop.
• The second property, dfs.replication, to one so that
HDFS doesn’t replicate filesystem blocks by the usual
default of three.
21IT401- BIG DATA ENGINEERING UNIT-I 6
HDFS Overview
Basic Filesystem Operations:
• HDFS (Hadoop Distributed File System)
operations involve various tasks related to
storing, accessing, and managing data within
the distributed file system. These operations
can be broadly classified into write, read, and
administrative operations.
21IT401- BIG DATA ENGINEERING UNIT-I 7
HDFS Overview
Write Operation
• File Write Request:
– A client initiates a request to write a file to HDFS.
• Block Allocation:
– The client contacts the NameNode to get a list of DataNodes for each block of the file.
– The NameNode returns a list of DataNodes for each block, ensuring the replication factor is
maintained.
• Block Writing:
– The client starts writing the file in chunks (blocks) to the first DataNode in the list.
– The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until
the replication factor is met.
• Acknowledgement:
– Each DataNode in the pipeline sends an acknowledgement back to the client once it has
received the block.
– Once all DataNodes have acknowledged the block, the client proceeds to the next block.
• Commit:
– After writing all blocks, the client notifies the NameNode that the file writing is complete.
21IT401- BIG DATA ENGINEERING UNIT-I 8
HDFS Overview
Read Operation
• File Read Request:
– A client initiates a request to read a file from HDFS.
• Block Location Retrieval:
– The client contacts the NameNode to get the locations of
the blocks of the file.
– The NameNode provides the list of DataNodes that have
the blocks.
• Block Reading:
– The client reads the blocks directly from the DataNodes.
– The client can choose the closest DataNode to optimize
read performance.
21IT401- BIG DATA ENGINEERING UNIT-I 9
HDFS Overview
Administrative Operations
• Namespace Operations:
– Create: Create a new file or directory in the HDFS namespace.
– Delete: Delete a file or directory from the HDFS namespace.
– Rename: Rename a file or directory.
– List: List files and directories within a directory.
• Metadata Operations:
– The NameNode handles metadata operations, such as
maintaining the namespace and mapping of file blocks to
DataNodes.
– Periodic checkpointing and merging of edit logs with the file
system image to ensure metadata consistency.
21IT401- BIG DATA ENGINEERING UNIT-I 10
HDFS Overview
Administrative Operations
• Replication Management:
– Block Replication: The NameNode monitors block replication and
ensures the configured replication factor is maintained.
– Under-replicated Blocks: If the replication factor of a block falls below
the desired level, the NameNode schedules replication of the block to
other DataNodes.
– Over-replicated Blocks: If a block is over-replicated, the NameNode
removes the extra replicas.
• Data Integrity:
– Checksums: HDFS maintains checksums for data blocks and verifies
them during read/write operations to ensure data integrity.
– Block Reports: DataNodes periodically send block reports to the
NameNode, which contains information about the blocks they store.
This helps the NameNode maintain an accurate view of the system’s
block distribution.
21IT401- BIG DATA ENGINEERING UNIT-I 11
HDFS Overview
Administrative Operations
• Heartbeats:
– DataNodes send regular heartbeats to the NameNode
to indicate that they are operational.
– If a DataNode fails to send a heartbeat within a
specified interval, the NameNode marks it as dead
and re-replicates its blocks to other DataNodes.
• Balancing:
– Rebalancer Tool: HDFS provides a rebalancer tool to
redistribute data across DataNodes to ensure a
balanced distribution and optimal performance.
21IT401- BIG DATA ENGINEERING UNIT-I 12
Hadoop and Spark
• Apache Spark is an open-source unified
analytics engine for big data processing, with
built-in modules for SQL, streaming, machine
learning, and graph processing.
• Developed at UC Berkeley's AMPLab, it aims
to provide fast, in-memory data processing.
21IT401- BIG DATA ENGINEERING UNIT-I 13
Hadoop and Spark
Key Components:
• Spark Core:
• Spark SQL:
• Spark Streaming:
• MLlib (Machine Learning Library):
• GraphX:
• SparkR:
21IT401- BIG DATA ENGINEERING UNIT-I 14
Hadoop and Spark
Advantages:
• Speed: In-memory processing significantly faster
than Hadoop's disk-based processing.
• Ease of Use: High-level APIs in Java, Scala,
Python, and R.
• Unified Framework: Supports batch processing,
stream processing, machine learning, and graph
processing within a single framework.
• Flexibility: Can be run on various cluster
managers including Hadoop YARN, Apache
Mesos, and Kubernetes, or standalone.
21IT401- BIG DATA ENGINEERING UNIT-I 15
Hadoop and Spark
Disadvantages:
• Memory Consumption: Requires substantial
memory resources, which can be costly.
• Stability: While mature, less mature than the
Hadoop ecosystem in terms of certain
features and tools.
• Complexity in Tuning: Performance tuning can
be complex due to numerous configurations
and memory management requirements.
21IT401- BIG DATA ENGINEERING UNIT-I 16
Hadoop and Spark
Integration of Hadoop and Spark
• Complementary Use:
• Storage: HDFS is often used as the storage
layer due to its robust, fault-tolerant
distributed storage capabilities.
• Processing: Spark can be used as the
processing engine due to its speed and
efficiency for in-memory processing.
21IT401- BIG DATA ENGINEERING UNIT-I 17
Map Reduce Programming Basics
There are two daemons associated with Map
Reduce Programming:
• Job Tracker
• Task Tracer
21IT401- BIG DATA ENGINEERING UNIT-I 18
Map Reduce Programming Basics
Job Tracker: Job Tracker is a master daemon
responsible for executing over Map Reduce
job. It provides connectivity between Hadoop
and application.
21IT401- BIG DATA ENGINEERING UNIT-I 19
Map Reduce Programming Basics
21IT401- BIG DATA ENGINEERING UNIT-I 20
Map Reduce Programming Basics
21IT401- BIG DATA ENGINEERING UNIT-I 21
TaskTracker: This is responsible for executing
individual tasks that is assigned by the Job
Tracker.
• Task Tracker continuously sends heartbeat
message to job tracker.
• When a job tracker fails to receive a heartbeat
message from a TaskTracker, the JobTracker
assumes that the TaskTracker has failed and
resubmits the task to another available node in
the cluster.
MapReduce Programming
Architecture
21IT401- BIG DATA ENGINEERING UNIT-I 22
21IT401- BIG DATA ENGINEERING UNIT-I 23
A MapReduce programming using Java requires
three classes:
• 1. Driver Class: This class specifies Job
configuration details.
• 2. MapperClass: this class overrides the
MapFunction based on the problem statement.
• 3. Reducer Class: This class overrides the
Reduce function based on the problem
statement.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 24
Map task takes care of loading, parsing,
transforming and filtering. The responsibility of
reduce task is grouping and aggregating data
that is produced by map tasks to generate final
output. Each map task is broken down into the
following phases:
1. Record Reader
2. Mapper
3. Combiner
4.Partitioner.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 25
1. RecordReader: converts byte oriented view of
input in to Record oriented view and presents
it to the Mapper tasks. It presents the tasks
with keys and values.
i) InputFormat: It reads the given input file
and splits using the method getsplits().
ii) Then it defines RecordReader using
createRecordReader() which is responsible for
generating <keys, value> pairs.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 26
2. Mapper: Map function works on the <keys,
value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void map(KEYIN key, VALUEIN
value, Context context): called once for each
key-value pair in input split.
- void run(Context context): user can override
this method for complete control over
execution of Mapper.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 27
3. Combiner: It takes intermediate pairs
provided by mapper and applies user specific
aggregate function to only one mapper. It is
also known as local Reducer.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 28
4. Partitioner: Take intermediate <keys, value>
pairs produced by the mapper, splits them
into partitions the data using a user-defined
condition.
• The default behavior is to hash the key to
determine the reducer.User can control by
using the method:
• int getPartition(KEY key, VALUE value, int
numPartitions )
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 29
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into
another set of data, where individual
elements are broken down into tuples
(key/value pairs).
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 30
• MapReduce program executes in two stages,
– Map stage − The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line.
– Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set
of output, which will be stored in the HDFS.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 31
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 32
Terminologies:
• PayLoad − Applications implement the Map
and the Reduce functions, and form the core of
the job.
• Mapper − Mapper maps the input key/value
pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
• DataNode − Node where data is presented in
advance before any processing takes place.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 33
Terminologies:
• MasterNode − Node where JobTracker runs
and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce
program runs.
• JobTracker − Schedules jobs and tracks the
assign jobs to Task tracker.
• Task Attempt − A particular instance of an
attempt to execute a task on a SlaveNode.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 34
Terminologies:
• Task Tracker − Tracks the task and reports
status to JobTracker.
• Job − A program is an execution of a Mapper
and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer
on a slice of data.
Developing Map Reduce
Application
21IT401- BIG DATA ENGINEERING UNIT-I 35
• Social networks
• Media and Entertainment
• Health Care
• Business
• Banking
• Stock Market
• Weather Forecasting

More Related Content

PPT
Big Data Engineering - MAP REDUCE - Unit-II.ppt
PDF
Hadoop, Taming Elephants
PPTX
Bigdata workshop february 2015
PPTX
Hadoop introduction
PDF
Intro to Big Data - Spark
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop.pptx
Big Data Engineering - MAP REDUCE - Unit-II.ppt
Hadoop, Taming Elephants
Bigdata workshop february 2015
Hadoop introduction
Intro to Big Data - Spark
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop.pptx

Similar to Big data engineering - Map and Reduce.ppt (20)

PPT
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
Hadoop programming
PPTX
Apache Hadoop Big Data Technology
PPTX
Big Data Analytics -Introduction education
PDF
getFamiliarWithHadoop
PPTX
Managing Big data with Hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PPTX
Hadoop_arunam_ppt
PDF
Big data overview by Edgars
PPTX
Bigdata and Hadoop Introduction
PPTX
Big data Hadoop
PPTX
Hadoop tutorial for beginners-tibacademy.in
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Hadoop
PPSX
Hadoop-Quick introduction
PPTX
HADOOP.pptx
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
PPTX
Hadoop ppt1
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop programming
Apache Hadoop Big Data Technology
Big Data Analytics -Introduction education
getFamiliarWithHadoop
Managing Big data with Hadoop
Hadoop and Big data in Big data and cloud.pptx
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
Hadoop_arunam_ppt
Big data overview by Edgars
Bigdata and Hadoop Introduction
Big data Hadoop
Hadoop tutorial for beginners-tibacademy.in
Apache hadoop, hdfs and map reduce Overview
Hadoop
Hadoop-Quick introduction
HADOOP.pptx
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
Hadoop ppt1
Ad

Recently uploaded (20)

PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PPT
Total quality management ppt for engineering students
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPT
Occupational Health and Safety Management System
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
communication and presentation skills 01
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Categorization of Factors Affecting Classification Algorithms Selection
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
Total quality management ppt for engineering students
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Occupational Health and Safety Management System
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
communication and presentation skills 01
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
UNIT 4 Total Quality Management .pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Visual Aids for Exploratory Data Analysis.pdf
Information Storage and Retrieval Techniques Unit III
Automation-in-Manufacturing-Chapter-Introduction.pdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Ad

Big data engineering - Map and Reduce.ppt

  • 1. Velammal College of Engineering and Technology (Autonomous) Department of Information Technology 21IT401 BIG DATA ENGINEERING
  • 2. Syllabus UNIT II - MAP REDUCE HDFS Overview, Hadoop and Spark, Map Reduce Programming Basics, Analyzing the data with Hadoop: Java MapReduce - Developing Map Reduce Application. 21IT401- BIG DATA ENGINEERING UNIT-II 2
  • 3. HDFS Overview • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. • HDFS is built around the idea that the most efficient data processing pattern is a write- once, read-many-times pattern. 21IT401- BIG DATA ENGINEERING UNIT-I 3
  • 4. HDFS Overview Blocks A disk has a block size, which is the minimum amount of data that it can read or write. File systems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Benefits: • A file can be larger than any single disk in the network. • Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. 21IT401- BIG DATA ENGINEERING UNIT-I 4
  • 5. HDFS Overview Name nodes and Data nodes: • A name node (the master) and a number of data nodes (workers). • The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. • This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The name node also knows the data nodes on which all the blocks for a given file are located. 21IT401- BIG DATA ENGINEERING UNIT-I 5
  • 6. HDFS Overview The Command-Line Interface: • There are many other interfaces to HDFS, but the command line is one of the simplest, and to many developers the most familiar. • There are two properties that we set in the pseudo- distributed configuration • The first is fs.default.name, set to hdfs://localhost/, which is used to set a default file system for Hadoop. • The second property, dfs.replication, to one so that HDFS doesn’t replicate filesystem blocks by the usual default of three. 21IT401- BIG DATA ENGINEERING UNIT-I 6
  • 7. HDFS Overview Basic Filesystem Operations: • HDFS (Hadoop Distributed File System) operations involve various tasks related to storing, accessing, and managing data within the distributed file system. These operations can be broadly classified into write, read, and administrative operations. 21IT401- BIG DATA ENGINEERING UNIT-I 7
  • 8. HDFS Overview Write Operation • File Write Request: – A client initiates a request to write a file to HDFS. • Block Allocation: – The client contacts the NameNode to get a list of DataNodes for each block of the file. – The NameNode returns a list of DataNodes for each block, ensuring the replication factor is maintained. • Block Writing: – The client starts writing the file in chunks (blocks) to the first DataNode in the list. – The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until the replication factor is met. • Acknowledgement: – Each DataNode in the pipeline sends an acknowledgement back to the client once it has received the block. – Once all DataNodes have acknowledged the block, the client proceeds to the next block. • Commit: – After writing all blocks, the client notifies the NameNode that the file writing is complete. 21IT401- BIG DATA ENGINEERING UNIT-I 8
  • 9. HDFS Overview Read Operation • File Read Request: – A client initiates a request to read a file from HDFS. • Block Location Retrieval: – The client contacts the NameNode to get the locations of the blocks of the file. – The NameNode provides the list of DataNodes that have the blocks. • Block Reading: – The client reads the blocks directly from the DataNodes. – The client can choose the closest DataNode to optimize read performance. 21IT401- BIG DATA ENGINEERING UNIT-I 9
  • 10. HDFS Overview Administrative Operations • Namespace Operations: – Create: Create a new file or directory in the HDFS namespace. – Delete: Delete a file or directory from the HDFS namespace. – Rename: Rename a file or directory. – List: List files and directories within a directory. • Metadata Operations: – The NameNode handles metadata operations, such as maintaining the namespace and mapping of file blocks to DataNodes. – Periodic checkpointing and merging of edit logs with the file system image to ensure metadata consistency. 21IT401- BIG DATA ENGINEERING UNIT-I 10
  • 11. HDFS Overview Administrative Operations • Replication Management: – Block Replication: The NameNode monitors block replication and ensures the configured replication factor is maintained. – Under-replicated Blocks: If the replication factor of a block falls below the desired level, the NameNode schedules replication of the block to other DataNodes. – Over-replicated Blocks: If a block is over-replicated, the NameNode removes the extra replicas. • Data Integrity: – Checksums: HDFS maintains checksums for data blocks and verifies them during read/write operations to ensure data integrity. – Block Reports: DataNodes periodically send block reports to the NameNode, which contains information about the blocks they store. This helps the NameNode maintain an accurate view of the system’s block distribution. 21IT401- BIG DATA ENGINEERING UNIT-I 11
  • 12. HDFS Overview Administrative Operations • Heartbeats: – DataNodes send regular heartbeats to the NameNode to indicate that they are operational. – If a DataNode fails to send a heartbeat within a specified interval, the NameNode marks it as dead and re-replicates its blocks to other DataNodes. • Balancing: – Rebalancer Tool: HDFS provides a rebalancer tool to redistribute data across DataNodes to ensure a balanced distribution and optimal performance. 21IT401- BIG DATA ENGINEERING UNIT-I 12
  • 13. Hadoop and Spark • Apache Spark is an open-source unified analytics engine for big data processing, with built-in modules for SQL, streaming, machine learning, and graph processing. • Developed at UC Berkeley's AMPLab, it aims to provide fast, in-memory data processing. 21IT401- BIG DATA ENGINEERING UNIT-I 13
  • 14. Hadoop and Spark Key Components: • Spark Core: • Spark SQL: • Spark Streaming: • MLlib (Machine Learning Library): • GraphX: • SparkR: 21IT401- BIG DATA ENGINEERING UNIT-I 14
  • 15. Hadoop and Spark Advantages: • Speed: In-memory processing significantly faster than Hadoop's disk-based processing. • Ease of Use: High-level APIs in Java, Scala, Python, and R. • Unified Framework: Supports batch processing, stream processing, machine learning, and graph processing within a single framework. • Flexibility: Can be run on various cluster managers including Hadoop YARN, Apache Mesos, and Kubernetes, or standalone. 21IT401- BIG DATA ENGINEERING UNIT-I 15
  • 16. Hadoop and Spark Disadvantages: • Memory Consumption: Requires substantial memory resources, which can be costly. • Stability: While mature, less mature than the Hadoop ecosystem in terms of certain features and tools. • Complexity in Tuning: Performance tuning can be complex due to numerous configurations and memory management requirements. 21IT401- BIG DATA ENGINEERING UNIT-I 16
  • 17. Hadoop and Spark Integration of Hadoop and Spark • Complementary Use: • Storage: HDFS is often used as the storage layer due to its robust, fault-tolerant distributed storage capabilities. • Processing: Spark can be used as the processing engine due to its speed and efficiency for in-memory processing. 21IT401- BIG DATA ENGINEERING UNIT-I 17
  • 18. Map Reduce Programming Basics There are two daemons associated with Map Reduce Programming: • Job Tracker • Task Tracer 21IT401- BIG DATA ENGINEERING UNIT-I 18
  • 19. Map Reduce Programming Basics Job Tracker: Job Tracker is a master daemon responsible for executing over Map Reduce job. It provides connectivity between Hadoop and application. 21IT401- BIG DATA ENGINEERING UNIT-I 19
  • 20. Map Reduce Programming Basics 21IT401- BIG DATA ENGINEERING UNIT-I 20
  • 21. Map Reduce Programming Basics 21IT401- BIG DATA ENGINEERING UNIT-I 21 TaskTracker: This is responsible for executing individual tasks that is assigned by the Job Tracker. • Task Tracker continuously sends heartbeat message to job tracker. • When a job tracker fails to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and resubmits the task to another available node in the cluster.
  • 23. 21IT401- BIG DATA ENGINEERING UNIT-I 23 A MapReduce programming using Java requires three classes: • 1. Driver Class: This class specifies Job configuration details. • 2. MapperClass: this class overrides the MapFunction based on the problem statement. • 3. Reducer Class: This class overrides the Reduce function based on the problem statement.
  • 24. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 24 Map task takes care of loading, parsing, transforming and filtering. The responsibility of reduce task is grouping and aggregating data that is produced by map tasks to generate final output. Each map task is broken down into the following phases: 1. Record Reader 2. Mapper 3. Combiner 4.Partitioner.
  • 25. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 25 1. RecordReader: converts byte oriented view of input in to Record oriented view and presents it to the Mapper tasks. It presents the tasks with keys and values. i) InputFormat: It reads the given input file and splits using the method getsplits(). ii) Then it defines RecordReader using createRecordReader() which is responsible for generating <keys, value> pairs.
  • 26. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 26 2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader and generates intermediate (key, value) pairs. Methods: - protected void map(KEYIN key, VALUEIN value, Context context): called once for each key-value pair in input split. - void run(Context context): user can override this method for complete control over execution of Mapper.
  • 27. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 27 3. Combiner: It takes intermediate pairs provided by mapper and applies user specific aggregate function to only one mapper. It is also known as local Reducer.
  • 28. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 28 4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits them into partitions the data using a user-defined condition. • The default behavior is to hash the key to determine the reducer.User can control by using the method: • int getPartition(KEY key, VALUE value, int numPartitions )
  • 29. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 29 • MapReduce is a processing technique and a program model for distributed computing based on java. • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
  • 30. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 30 • MapReduce program executes in two stages, – Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. – Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 31. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 31
  • 32. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 32 Terminologies: • PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job. • Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair. • NamedNode − Node that manages the Hadoop Distributed File System (HDFS). • DataNode − Node where data is presented in advance before any processing takes place.
  • 33. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 33 Terminologies: • MasterNode − Node where JobTracker runs and which accepts job requests from clients. • SlaveNode − Node where Map and Reduce program runs. • JobTracker − Schedules jobs and tracks the assign jobs to Task tracker. • Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
  • 34. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 34 Terminologies: • Task Tracker − Tracks the task and reports status to JobTracker. • Job − A program is an execution of a Mapper and Reducer across a dataset. • Task − An execution of a Mapper or a Reducer on a slice of data.
  • 35. Developing Map Reduce Application 21IT401- BIG DATA ENGINEERING UNIT-I 35 • Social networks • Media and Entertainment • Health Care • Business • Banking • Stock Market • Weather Forecasting