Big Data Engineering - MAP REDUCE - Unit-II.ppt

Velammal College of Engineering and Technology
(Autonomous)
Department of Information Technology
21IT401
BIG DATA ENGINEERING

Syllabus
UNIT II - MAP REDUCE
HDFS Overview, Hadoop and Spark, Map Reduce
Programming Basics, Analyzing the data with
Hadoop: Java MapReduce - Developing Map Reduce
Application - Running Locally on Test Data -
Running on a Cluster - MapReduce Workflow
21IT401- BIG DATA ENGINEERING UNIT-II 2

HDFS Overview
• HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters on commodity hardware.
• HDFS is built around the idea that the most
efficient data processing pattern is a write-
once, read-many-times pattern.
21IT401- BIG DATA ENGINEERING UNIT-I 3

HDFS Overview
Blocks
A disk has a block size, which is the minimum amount
of data that it can read or write. File systems for a
single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size.
Benefits:
• A file can be larger than any single disk in the network.
• Second, making the unit of abstraction a block rather
than a file simplifies the storage subsystem.

HDFS Overview
Name nodes and Data nodes:
• A name node (the master) and a number of data nodes
(workers).
• The name node manages the file system namespace. It
maintains the file system tree and the metadata for all
the files and directories in the tree.
• This information is stored persistently on the local disk
in the form of two files: the namespace image and the
edit log. The name node also knows the data nodes on
which all the blocks for a given file are located.

HDFS Overview
The Command-Line Interface:
• There are many other interfaces to HDFS, but the
command line is one of the simplest, and to many
developers the most familiar.
• There are two properties that we set in the pseudo-
distributed configuration
• The first is fs.default.name, set to hdfs://localhost/,
which is used to set a default file system for Hadoop.
• The second property, dfs.replication, to one so that
HDFS doesn’t replicate filesystem blocks by the usual
default of three.

HDFS Overview
Basic Filesystem Operations:
• HDFS (Hadoop Distributed File System)
operations involve various tasks related to
storing, accessing, and managing data within
the distributed file system. These operations
can be broadly classified into write, read, and
administrative operations.

HDFS Overview
Write Operation
• File Write Request:
– A client initiates a request to write a file to HDFS.
• Block Allocation:
– The client contacts the NameNode to get a list of DataNodes for each block of the file.
– The NameNode returns a list of DataNodes for each block, ensuring the replication factor is
maintained.
• Block Writing:
– The client starts writing the file in chunks (blocks) to the first DataNode in the list.
– The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until
the replication factor is met.
• Acknowledgement:
– Each DataNode in the pipeline sends an acknowledgement back to the client once it has
received the block.
– Once all DataNodes have acknowledged the block, the client proceeds to the next block.
• Commit:
– After writing all blocks, the client notifies the NameNode that the file writing is complete.

HDFS Overview
Read Operation
• File Read Request:
– A client initiates a request to read a file from HDFS.
• Block Location Retrieval:
– The client contacts the NameNode to get the locations of
the blocks of the file.
– The NameNode provides the list of DataNodes that have
the blocks.
• Block Reading:
– The client reads the blocks directly from the DataNodes.
– The client can choose the closest DataNode to optimize
read performance.

HDFS Overview
Administrative Operations
• Namespace Operations:
– Create: Create a new file or directory in the HDFS namespace.
– Delete: Delete a file or directory from the HDFS namespace.
– Rename: Rename a file or directory.
– List: List files and directories within a directory.
• Metadata Operations:
– The NameNode handles metadata operations, such as
maintaining the namespace and mapping of file blocks to
DataNodes.
– Periodic checkpointing and merging of edit logs with the file
system image to ensure metadata consistency.

HDFS Overview
• Replication Management:
– Block Replication: The NameNode monitors block replication and
ensures the configured replication factor is maintained.
– Under-replicated Blocks: If the replication factor of a block falls below
the desired level, the NameNode schedules replication of the block to
other DataNodes.
– Over-replicated Blocks: If a block is over-replicated, the NameNode
removes the extra replicas.
• Data Integrity:
– Checksums: HDFS maintains checksums for data blocks and verifies
them during read/write operations to ensure data integrity.
– Block Reports: DataNodes periodically send block reports to the
NameNode, which contains information about the blocks they store.
This helps the NameNode maintain an accurate view of the system’s
block distribution.

HDFS Overview
• Heartbeats:
– DataNodes send regular heartbeats to the NameNode
to indicate that they are operational.
– If a DataNode fails to send a heartbeat within a
specified interval, the NameNode marks it as dead
and re-replicates its blocks to other DataNodes.
• Balancing:
– Rebalancer Tool: HDFS provides a rebalancer tool to
redistribute data across DataNodes to ensure a
balanced distribution and optimal performance.

Hadoop and Spark
• Apache Spark is an open-source unified
analytics engine for big data processing, with
built-in modules for SQL, streaming, machine
learning, and graph processing.
• Developed at UC Berkeley's AMPLab, it aims
to provide fast, in-memory data processing.

Hadoop and Spark
Key Components:
• Spark Core:
• Spark SQL:
• Spark Streaming:
• MLlib (Machine Learning Library):
• GraphX:
• SparkR:

Hadoop and Spark
Advantages:
• Speed: In-memory processing significantly faster
than Hadoop's disk-based processing.
• Ease of Use: High-level APIs in Java, Scala,
Python, and R.
• Unified Framework: Supports batch processing,
stream processing, machine learning, and graph
processing within a single framework.
• Flexibility: Can be run on various cluster
managers including Hadoop YARN, Apache
Mesos, and Kubernetes, or standalone.

Hadoop and Spark
Disadvantages:
• Memory Consumption: Requires substantial
memory resources, which can be costly.
• Stability: While mature, less mature than the
Hadoop ecosystem in terms of certain
features and tools.
• Complexity in Tuning: Performance tuning can
be complex due to numerous configurations
and memory management requirements.

Hadoop and Spark
Integration of Hadoop and Spark
• Complementary Use:
• Storage: HDFS is often used as the storage
layer due to its robust, fault-tolerant
distributed storage capabilities.
• Processing: Spark can be used as the
processing engine due to its speed and
efficiency for in-memory processing.

Map Reduce Programming Basics
There are two daemons associated with Map
Reduce Programming:
• Job Tracker
• Task Tracer

Job Tracker: Job Tracker is a master daemon
responsible for executing over Map Reduce
job. It provides connectivity between Hadoop
and application.

TaskTracker: This is responsible for executing
individual tasks that is assigned by the Job
Tracker.
• Task Tracker continuously sends heartbeat
message to job tracker.
• When a job tracker fails to receive a heartbeat
message from a TaskTracker, the JobTracker
assumes that the TaskTracker has failed and
resubmits the task to another available node in
the cluster.

MapReduce Programming
Architecture

A MapReduce programming using Java requires
three classes:
• 1. Driver Class: This class specifies Job
configuration details.
• 2. MapperClass: this class overrides the
MapFunction based on the problem statement.
• 3. Reducer Class: This class overrides the
Reduce function based on the problem
statement.

Analyzing the data with Hadoop
Map task takes care of loading, parsing,
transforming and filtering. The responsibility of
reduce task is grouping and aggregating data
that is produced by map tasks to generate final
output. Each map task is broken down into the
following phases:
1. Record Reader
2. Mapper
3. Combiner
4.Partitioner.

1. RecordReader: converts byte oriented view of
input in to Record oriented view and presents
it to the Mapper tasks. It presents the tasks
with keys and values.
i) InputFormat: It reads the given input file
and splits using the method getsplits().
ii) Then it defines RecordReader using
createRecordReader() which is responsible for
generating <keys, value> pairs.

2. Mapper: Map function works on the <keys,
value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void map(KEYIN key, VALUEIN
value, Context context): called once for each
key-value pair in input split.
- void run(Context context): user can override
this method for complete control over
execution of Mapper.

3. Combiner: It takes intermediate pairs
provided by mapper and applies user specific
aggregate function to only one mapper. It is
also known as local Reducer.

4. Partitioner: Take intermediate <keys, value>
pairs produced by the mapper, splits them
into partitions the data using a user-defined
condition.
• The default behavior is to hash the key to
determine the reducer.User can control by
using the method:
• int getPartition(KEY key, VALUE value, int
numPartitions )

Java Map Reduce
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into
another set of data, where individual
elements are broken down into tuples
(key/value pairs).

Java Map Reduce
• MapReduce program executes in two stages,
– Map stage − The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line.
– Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set
of output, which will be stored in the HDFS.

Java Map Reduce

Java Map Reduce
Terminologies:
• PayLoad − Applications implement the Map
and the Reduce functions, and form the core of
the job.
• Mapper − Mapper maps the input key/value
pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
• DataNode − Node where data is presented in
advance before any processing takes place.

Java Map Reduce
Terminologies:
• MasterNode − Node where JobTracker runs
and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce
program runs.
• JobTracker − Schedules jobs and tracks the
assign jobs to Task tracker.
• Task Attempt − A particular instance of an
attempt to execute a task on a SlaveNode.

Java Map Reduce
Terminologies:
• Task Tracker − Tracks the task and reports
status to JobTracker.
• Job − A program is an execution of a Mapper
and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer
on a slice of data.

Developing Map Reduce
Application
• Social networks
• Media and Entertainment
• Health Care
• Business
• Banking
• Stock Market
• Weather Forecasting

Running Locally on Test Data
Why Run Locally?
• Faster feedback loop
• Lower resource cost
• Easier debugging
• Safe from data corruption
Common Tools for Local Testing
• Apache Spark (local mode)
• Hadoop (pseudo-distributed mode)
• Docker containers
• Python notebooks (PySpark, Pandas for mock data)
21IT401- BIG DATA ENGINEERING
UNIT-I
36

Steps to Run Locally
• Prepare sample test data (CSV/JSON/Parquet)
• Load test data into local file system
• Run code on local execution engine (e.g., Spark local)
• Validate results/output
• Debug and iterate
Example – PySpark Local Testing
• from pyspark.sql import SparkSession spark =
SparkSession.builder.master("local[*]").appName("TestRun").getOrCreate()
df = spark.read.csv("test_data.csv", header=True) df.show()
• Output: View result locally and check transformations
UNIT-I
37

Challenges
• Small data may not show scale issues
• Local environment may lack some distributed
configs
• Not suitable for performance testing
UNIT-I
38

Running on a Cluster
Running on a cluster enables parallel processing
of large-scale data across multiple machines.
UNIT-I
39

- Handle massive datasets
- Distributed computing power
- Fault tolerance
- Scalability and performance
UNIT-I
40
Why Run on a Cluster?

Cluster Components
- Master Node: Job coordination, resource
management
- Worker Nodes: Execute tasks, store data
- Distributed File System: HDFS, Amazon S3

Popular Cluster Frameworks
- Apache Hadoop (MapReduce)
- Apache Spark
- Apache Flink
- Kubernetes for container orchestration

Execution Workflow
1. Submit job to master
2. Job is split into tasks
3. Tasks are assigned to worker nodes
4. Output is collected and written to storage

Monitoring & Logs
- Use tools like YARN Resource Manager UI,
Spark UI
- Logs can be fetched from master or centralized
log systems

Challenges
- Debugging is harder
- Cluster resource contention
- Network latency and node failures

MapReduce Workflow Example
Problem: Count word frequency
1. Map: Emit (word, 1)
2. Shuffle/Sort: Group by word
3. Reduce: Sum values for each word

MapReduce Code Structure
Mapper:
def map(key, value):
for word in value.split():
emit(word, 1)
Reducer:
def reduce(key, values):
emit(key, sum(values))

Workflow with Hadoop
- Input data in HDFS
- MapReduce job submitted to YARN
- Data split into blocks
- Mapper and Reducer tasks executed
- Output written back to HDFS

Chaining MapReduce Jobs
- Output of one job becomes input of the next
- Enables multi-stage data processing
- Example: Cleaning → Transforming →
Aggregating

Tools and Frameworks
- Hadoop MapReduce
- Apache Pig (high-level abstraction)
- Apache Hive (SQL-like queries using
MapReduce)
- Cascading, Crunch

Challenges
- High latency for iterative tasks
- Complex debugging
- Manual optimization required

Big Data Engineering - MAP REDUCE - Unit-II.ppt

More Related Content

Similar to Big Data Engineering - MAP REDUCE - Unit-II.ppt (20)

Recently uploaded (20)

Big Data Engineering - MAP REDUCE - Unit-II.ppt