1. Velammal College of Engineering and Technology
(Autonomous)
Department of Information Technology
21IT401
BIG DATA ENGINEERING
2. Syllabus
UNIT II - MAP REDUCE
HDFS Overview, Hadoop and Spark,
Map Reduce Programming Basics,
Analyzing the data with Hadoop: Java MapReduce -
Developing Map Reduce Application.
21IT401- BIG DATA ENGINEERING UNIT-II 2
3. HDFS Overview
• HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters on commodity hardware.
• HDFS is built around the idea that the most
efficient data processing pattern is a write-
once, read-many-times pattern.
21IT401- BIG DATA ENGINEERING UNIT-I 3
4. HDFS Overview
Blocks
A disk has a block size, which is the minimum amount
of data that it can read or write. File systems for a
single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size.
Benefits:
• A file can be larger than any single disk in the network.
• Second, making the unit of abstraction a block rather
than a file simplifies the storage subsystem.
21IT401- BIG DATA ENGINEERING UNIT-I 4
5. HDFS Overview
Name nodes and Data nodes:
• A name node (the master) and a number of data nodes
(workers).
• The name node manages the file system namespace. It
maintains the file system tree and the metadata for all
the files and directories in the tree.
• This information is stored persistently on the local disk
in the form of two files: the namespace image and the
edit log. The name node also knows the data nodes on
which all the blocks for a given file are located.
21IT401- BIG DATA ENGINEERING UNIT-I 5
6. HDFS Overview
The Command-Line Interface:
• There are many other interfaces to HDFS, but the
command line is one of the simplest, and to many
developers the most familiar.
• There are two properties that we set in the pseudo-
distributed configuration
• The first is fs.default.name, set to hdfs://localhost/,
which is used to set a default file system for Hadoop.
• The second property, dfs.replication, to one so that
HDFS doesn’t replicate filesystem blocks by the usual
default of three.
21IT401- BIG DATA ENGINEERING UNIT-I 6
7. HDFS Overview
Basic Filesystem Operations:
• HDFS (Hadoop Distributed File System)
operations involve various tasks related to
storing, accessing, and managing data within
the distributed file system. These operations
can be broadly classified into write, read, and
administrative operations.
21IT401- BIG DATA ENGINEERING UNIT-I 7
8. HDFS Overview
Write Operation
• File Write Request:
– A client initiates a request to write a file to HDFS.
• Block Allocation:
– The client contacts the NameNode to get a list of DataNodes for each block of the file.
– The NameNode returns a list of DataNodes for each block, ensuring the replication factor is
maintained.
• Block Writing:
– The client starts writing the file in chunks (blocks) to the first DataNode in the list.
– The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until
the replication factor is met.
• Acknowledgement:
– Each DataNode in the pipeline sends an acknowledgement back to the client once it has
received the block.
– Once all DataNodes have acknowledged the block, the client proceeds to the next block.
• Commit:
– After writing all blocks, the client notifies the NameNode that the file writing is complete.
21IT401- BIG DATA ENGINEERING UNIT-I 8
9. HDFS Overview
Read Operation
• File Read Request:
– A client initiates a request to read a file from HDFS.
• Block Location Retrieval:
– The client contacts the NameNode to get the locations of
the blocks of the file.
– The NameNode provides the list of DataNodes that have
the blocks.
• Block Reading:
– The client reads the blocks directly from the DataNodes.
– The client can choose the closest DataNode to optimize
read performance.
21IT401- BIG DATA ENGINEERING UNIT-I 9
10. HDFS Overview
Administrative Operations
• Namespace Operations:
– Create: Create a new file or directory in the HDFS namespace.
– Delete: Delete a file or directory from the HDFS namespace.
– Rename: Rename a file or directory.
– List: List files and directories within a directory.
• Metadata Operations:
– The NameNode handles metadata operations, such as
maintaining the namespace and mapping of file blocks to
DataNodes.
– Periodic checkpointing and merging of edit logs with the file
system image to ensure metadata consistency.
21IT401- BIG DATA ENGINEERING UNIT-I 10
11. HDFS Overview
Administrative Operations
• Replication Management:
– Block Replication: The NameNode monitors block replication and
ensures the configured replication factor is maintained.
– Under-replicated Blocks: If the replication factor of a block falls below
the desired level, the NameNode schedules replication of the block to
other DataNodes.
– Over-replicated Blocks: If a block is over-replicated, the NameNode
removes the extra replicas.
• Data Integrity:
– Checksums: HDFS maintains checksums for data blocks and verifies
them during read/write operations to ensure data integrity.
– Block Reports: DataNodes periodically send block reports to the
NameNode, which contains information about the blocks they store.
This helps the NameNode maintain an accurate view of the system’s
block distribution.
21IT401- BIG DATA ENGINEERING UNIT-I 11
12. HDFS Overview
Administrative Operations
• Heartbeats:
– DataNodes send regular heartbeats to the NameNode
to indicate that they are operational.
– If a DataNode fails to send a heartbeat within a
specified interval, the NameNode marks it as dead
and re-replicates its blocks to other DataNodes.
• Balancing:
– Rebalancer Tool: HDFS provides a rebalancer tool to
redistribute data across DataNodes to ensure a
balanced distribution and optimal performance.
21IT401- BIG DATA ENGINEERING UNIT-I 12
13. Hadoop and Spark
• Apache Spark is an open-source unified
analytics engine for big data processing, with
built-in modules for SQL, streaming, machine
learning, and graph processing.
• Developed at UC Berkeley's AMPLab, it aims
to provide fast, in-memory data processing.
21IT401- BIG DATA ENGINEERING UNIT-I 13
14. Hadoop and Spark
Key Components:
• Spark Core:
• Spark SQL:
• Spark Streaming:
• MLlib (Machine Learning Library):
• GraphX:
• SparkR:
21IT401- BIG DATA ENGINEERING UNIT-I 14
15. Hadoop and Spark
Advantages:
• Speed: In-memory processing significantly faster
than Hadoop's disk-based processing.
• Ease of Use: High-level APIs in Java, Scala,
Python, and R.
• Unified Framework: Supports batch processing,
stream processing, machine learning, and graph
processing within a single framework.
• Flexibility: Can be run on various cluster
managers including Hadoop YARN, Apache
Mesos, and Kubernetes, or standalone.
21IT401- BIG DATA ENGINEERING UNIT-I 15
16. Hadoop and Spark
Disadvantages:
• Memory Consumption: Requires substantial
memory resources, which can be costly.
• Stability: While mature, less mature than the
Hadoop ecosystem in terms of certain
features and tools.
• Complexity in Tuning: Performance tuning can
be complex due to numerous configurations
and memory management requirements.
21IT401- BIG DATA ENGINEERING UNIT-I 16
17. Hadoop and Spark
Integration of Hadoop and Spark
• Complementary Use:
• Storage: HDFS is often used as the storage
layer due to its robust, fault-tolerant
distributed storage capabilities.
• Processing: Spark can be used as the
processing engine due to its speed and
efficiency for in-memory processing.
21IT401- BIG DATA ENGINEERING UNIT-I 17
18. Map Reduce Programming Basics
There are two daemons associated with Map
Reduce Programming:
• Job Tracker
• Task Tracer
21IT401- BIG DATA ENGINEERING UNIT-I 18
19. Map Reduce Programming Basics
Job Tracker: Job Tracker is a master daemon
responsible for executing over Map Reduce
job. It provides connectivity between Hadoop
and application.
21IT401- BIG DATA ENGINEERING UNIT-I 19
21. Map Reduce Programming Basics
21IT401- BIG DATA ENGINEERING UNIT-I 21
TaskTracker: This is responsible for executing
individual tasks that is assigned by the Job
Tracker.
• Task Tracker continuously sends heartbeat
message to job tracker.
• When a job tracker fails to receive a heartbeat
message from a TaskTracker, the JobTracker
assumes that the TaskTracker has failed and
resubmits the task to another available node in
the cluster.
23. 21IT401- BIG DATA ENGINEERING UNIT-I 23
A MapReduce programming using Java requires
three classes:
• 1. Driver Class: This class specifies Job
configuration details.
• 2. MapperClass: this class overrides the
MapFunction based on the problem statement.
• 3. Reducer Class: This class overrides the
Reduce function based on the problem
statement.
24. Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 24
Map task takes care of loading, parsing,
transforming and filtering. The responsibility of
reduce task is grouping and aggregating data
that is produced by map tasks to generate final
output. Each map task is broken down into the
following phases:
1. Record Reader
2. Mapper
3. Combiner
4.Partitioner.
25. Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 25
1. RecordReader: converts byte oriented view of
input in to Record oriented view and presents
it to the Mapper tasks. It presents the tasks
with keys and values.
i) InputFormat: It reads the given input file
and splits using the method getsplits().
ii) Then it defines RecordReader using
createRecordReader() which is responsible for
generating <keys, value> pairs.
26. Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 26
2. Mapper: Map function works on the <keys,
value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void map(KEYIN key, VALUEIN
value, Context context): called once for each
key-value pair in input split.
- void run(Context context): user can override
this method for complete control over
execution of Mapper.
27. Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 27
3. Combiner: It takes intermediate pairs
provided by mapper and applies user specific
aggregate function to only one mapper. It is
also known as local Reducer.
28. Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 28
4. Partitioner: Take intermediate <keys, value>
pairs produced by the mapper, splits them
into partitions the data using a user-defined
condition.
• The default behavior is to hash the key to
determine the reducer.User can control by
using the method:
• int getPartition(KEY key, VALUE value, int
numPartitions )
29. Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 29
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into
another set of data, where individual
elements are broken down into tuples
(key/value pairs).
30. Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 30
• MapReduce program executes in two stages,
– Map stage − The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line.
– Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set
of output, which will be stored in the HDFS.
32. Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 32
Terminologies:
• PayLoad − Applications implement the Map
and the Reduce functions, and form the core of
the job.
• Mapper − Mapper maps the input key/value
pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
• DataNode − Node where data is presented in
advance before any processing takes place.
33. Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 33
Terminologies:
• MasterNode − Node where JobTracker runs
and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce
program runs.
• JobTracker − Schedules jobs and tracks the
assign jobs to Task tracker.
• Task Attempt − A particular instance of an
attempt to execute a task on a SlaveNode.
34. Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 34
Terminologies:
• Task Tracker − Tracks the task and reports
status to JobTracker.
• Job − A program is an execution of a Mapper
and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer
on a slice of data.
35. Developing Map Reduce
Application
21IT401- BIG DATA ENGINEERING UNIT-I 35
• Social networks
• Media and Entertainment
• Health Care
• Business
• Banking
• Stock Market
• Weather Forecasting