SlideShare a Scribd company logo
Velammal College of Engineering and Technology
(Autonomous)
Department of Information Technology
21IT401
BIG DATA ENGINEERING
Syllabus
UNIT II - MAP REDUCE
HDFS Overview, Hadoop and Spark, Map Reduce
Programming Basics, Analyzing the data with
Hadoop: Java MapReduce - Developing Map Reduce
Application - Running Locally on Test Data -
Running on a Cluster - MapReduce Workflow
21IT401- BIG DATA ENGINEERING UNIT-II 2
HDFS Overview
• HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters on commodity hardware.
• HDFS is built around the idea that the most
efficient data processing pattern is a write-
once, read-many-times pattern.
21IT401- BIG DATA ENGINEERING UNIT-I 3
HDFS Overview
Blocks
A disk has a block size, which is the minimum amount
of data that it can read or write. File systems for a
single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size.
Benefits:
• A file can be larger than any single disk in the network.
• Second, making the unit of abstraction a block rather
than a file simplifies the storage subsystem.
21IT401- BIG DATA ENGINEERING UNIT-I 4
HDFS Overview
Name nodes and Data nodes:
• A name node (the master) and a number of data nodes
(workers).
• The name node manages the file system namespace. It
maintains the file system tree and the metadata for all
the files and directories in the tree.
• This information is stored persistently on the local disk
in the form of two files: the namespace image and the
edit log. The name node also knows the data nodes on
which all the blocks for a given file are located.
21IT401- BIG DATA ENGINEERING UNIT-I 5
HDFS Overview
The Command-Line Interface:
• There are many other interfaces to HDFS, but the
command line is one of the simplest, and to many
developers the most familiar.
• There are two properties that we set in the pseudo-
distributed configuration
• The first is fs.default.name, set to hdfs://localhost/,
which is used to set a default file system for Hadoop.
• The second property, dfs.replication, to one so that
HDFS doesn’t replicate filesystem blocks by the usual
default of three.
21IT401- BIG DATA ENGINEERING UNIT-I 6
HDFS Overview
Basic Filesystem Operations:
• HDFS (Hadoop Distributed File System)
operations involve various tasks related to
storing, accessing, and managing data within
the distributed file system. These operations
can be broadly classified into write, read, and
administrative operations.
21IT401- BIG DATA ENGINEERING UNIT-I 7
HDFS Overview
Write Operation
• File Write Request:
– A client initiates a request to write a file to HDFS.
• Block Allocation:
– The client contacts the NameNode to get a list of DataNodes for each block of the file.
– The NameNode returns a list of DataNodes for each block, ensuring the replication factor is
maintained.
• Block Writing:
– The client starts writing the file in chunks (blocks) to the first DataNode in the list.
– The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until
the replication factor is met.
• Acknowledgement:
– Each DataNode in the pipeline sends an acknowledgement back to the client once it has
received the block.
– Once all DataNodes have acknowledged the block, the client proceeds to the next block.
• Commit:
– After writing all blocks, the client notifies the NameNode that the file writing is complete.
21IT401- BIG DATA ENGINEERING UNIT-I 8
HDFS Overview
Read Operation
• File Read Request:
– A client initiates a request to read a file from HDFS.
• Block Location Retrieval:
– The client contacts the NameNode to get the locations of
the blocks of the file.
– The NameNode provides the list of DataNodes that have
the blocks.
• Block Reading:
– The client reads the blocks directly from the DataNodes.
– The client can choose the closest DataNode to optimize
read performance.
21IT401- BIG DATA ENGINEERING UNIT-I 9
HDFS Overview
Administrative Operations
• Namespace Operations:
– Create: Create a new file or directory in the HDFS namespace.
– Delete: Delete a file or directory from the HDFS namespace.
– Rename: Rename a file or directory.
– List: List files and directories within a directory.
• Metadata Operations:
– The NameNode handles metadata operations, such as
maintaining the namespace and mapping of file blocks to
DataNodes.
– Periodic checkpointing and merging of edit logs with the file
system image to ensure metadata consistency.
21IT401- BIG DATA ENGINEERING UNIT-I 10
HDFS Overview
Administrative Operations
• Replication Management:
– Block Replication: The NameNode monitors block replication and
ensures the configured replication factor is maintained.
– Under-replicated Blocks: If the replication factor of a block falls below
the desired level, the NameNode schedules replication of the block to
other DataNodes.
– Over-replicated Blocks: If a block is over-replicated, the NameNode
removes the extra replicas.
• Data Integrity:
– Checksums: HDFS maintains checksums for data blocks and verifies
them during read/write operations to ensure data integrity.
– Block Reports: DataNodes periodically send block reports to the
NameNode, which contains information about the blocks they store.
This helps the NameNode maintain an accurate view of the system’s
block distribution.
21IT401- BIG DATA ENGINEERING UNIT-I 11
HDFS Overview
Administrative Operations
• Heartbeats:
– DataNodes send regular heartbeats to the NameNode
to indicate that they are operational.
– If a DataNode fails to send a heartbeat within a
specified interval, the NameNode marks it as dead
and re-replicates its blocks to other DataNodes.
• Balancing:
– Rebalancer Tool: HDFS provides a rebalancer tool to
redistribute data across DataNodes to ensure a
balanced distribution and optimal performance.
21IT401- BIG DATA ENGINEERING UNIT-I 12
Hadoop and Spark
• Apache Spark is an open-source unified
analytics engine for big data processing, with
built-in modules for SQL, streaming, machine
learning, and graph processing.
• Developed at UC Berkeley's AMPLab, it aims
to provide fast, in-memory data processing.
21IT401- BIG DATA ENGINEERING UNIT-I 13
Hadoop and Spark
Key Components:
• Spark Core:
• Spark SQL:
• Spark Streaming:
• MLlib (Machine Learning Library):
• GraphX:
• SparkR:
21IT401- BIG DATA ENGINEERING UNIT-I 14
Hadoop and Spark
Advantages:
• Speed: In-memory processing significantly faster
than Hadoop's disk-based processing.
• Ease of Use: High-level APIs in Java, Scala,
Python, and R.
• Unified Framework: Supports batch processing,
stream processing, machine learning, and graph
processing within a single framework.
• Flexibility: Can be run on various cluster
managers including Hadoop YARN, Apache
Mesos, and Kubernetes, or standalone.
21IT401- BIG DATA ENGINEERING UNIT-I 15
Hadoop and Spark
Disadvantages:
• Memory Consumption: Requires substantial
memory resources, which can be costly.
• Stability: While mature, less mature than the
Hadoop ecosystem in terms of certain
features and tools.
• Complexity in Tuning: Performance tuning can
be complex due to numerous configurations
and memory management requirements.
21IT401- BIG DATA ENGINEERING UNIT-I 16
Hadoop and Spark
Integration of Hadoop and Spark
• Complementary Use:
• Storage: HDFS is often used as the storage
layer due to its robust, fault-tolerant
distributed storage capabilities.
• Processing: Spark can be used as the
processing engine due to its speed and
efficiency for in-memory processing.
21IT401- BIG DATA ENGINEERING UNIT-I 17
Map Reduce Programming Basics
There are two daemons associated with Map
Reduce Programming:
• Job Tracker
• Task Tracer
21IT401- BIG DATA ENGINEERING UNIT-I 18
Map Reduce Programming Basics
Job Tracker: Job Tracker is a master daemon
responsible for executing over Map Reduce
job. It provides connectivity between Hadoop
and application.
21IT401- BIG DATA ENGINEERING UNIT-I 19
Map Reduce Programming Basics
21IT401- BIG DATA ENGINEERING UNIT-I 20
Map Reduce Programming Basics
21IT401- BIG DATA ENGINEERING UNIT-I 21
TaskTracker: This is responsible for executing
individual tasks that is assigned by the Job
Tracker.
• Task Tracker continuously sends heartbeat
message to job tracker.
• When a job tracker fails to receive a heartbeat
message from a TaskTracker, the JobTracker
assumes that the TaskTracker has failed and
resubmits the task to another available node in
the cluster.
MapReduce Programming
Architecture
21IT401- BIG DATA ENGINEERING UNIT-I 22
21IT401- BIG DATA ENGINEERING UNIT-I 23
A MapReduce programming using Java requires
three classes:
• 1. Driver Class: This class specifies Job
configuration details.
• 2. MapperClass: this class overrides the
MapFunction based on the problem statement.
• 3. Reducer Class: This class overrides the
Reduce function based on the problem
statement.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 24
Map task takes care of loading, parsing,
transforming and filtering. The responsibility of
reduce task is grouping and aggregating data
that is produced by map tasks to generate final
output. Each map task is broken down into the
following phases:
1. Record Reader
2. Mapper
3. Combiner
4.Partitioner.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 25
1. RecordReader: converts byte oriented view of
input in to Record oriented view and presents
it to the Mapper tasks. It presents the tasks
with keys and values.
i) InputFormat: It reads the given input file
and splits using the method getsplits().
ii) Then it defines RecordReader using
createRecordReader() which is responsible for
generating <keys, value> pairs.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 26
2. Mapper: Map function works on the <keys,
value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void map(KEYIN key, VALUEIN
value, Context context): called once for each
key-value pair in input split.
- void run(Context context): user can override
this method for complete control over
execution of Mapper.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 27
3. Combiner: It takes intermediate pairs
provided by mapper and applies user specific
aggregate function to only one mapper. It is
also known as local Reducer.
Analyzing the data with Hadoop
21IT401- BIG DATA ENGINEERING UNIT-I 28
4. Partitioner: Take intermediate <keys, value>
pairs produced by the mapper, splits them
into partitions the data using a user-defined
condition.
• The default behavior is to hash the key to
determine the reducer.User can control by
using the method:
• int getPartition(KEY key, VALUE value, int
numPartitions )
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 29
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into
another set of data, where individual
elements are broken down into tuples
(key/value pairs).
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 30
• MapReduce program executes in two stages,
– Map stage − The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line.
– Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set
of output, which will be stored in the HDFS.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 31
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 32
Terminologies:
• PayLoad − Applications implement the Map
and the Reduce functions, and form the core of
the job.
• Mapper − Mapper maps the input key/value
pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop
Distributed File System (HDFS).
• DataNode − Node where data is presented in
advance before any processing takes place.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 33
Terminologies:
• MasterNode − Node where JobTracker runs
and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce
program runs.
• JobTracker − Schedules jobs and tracks the
assign jobs to Task tracker.
• Task Attempt − A particular instance of an
attempt to execute a task on a SlaveNode.
Java Map Reduce
21IT401- BIG DATA ENGINEERING UNIT-I 34
Terminologies:
• Task Tracker − Tracks the task and reports
status to JobTracker.
• Job − A program is an execution of a Mapper
and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer
on a slice of data.
Developing Map Reduce
Application
21IT401- BIG DATA ENGINEERING UNIT-I 35
• Social networks
• Media and Entertainment
• Health Care
• Business
• Banking
• Stock Market
• Weather Forecasting
Running Locally on Test Data
Why Run Locally?
• Faster feedback loop
• Lower resource cost
• Easier debugging
• Safe from data corruption
Common Tools for Local Testing
• Apache Spark (local mode)
• Hadoop (pseudo-distributed mode)
• Docker containers
• Python notebooks (PySpark, Pandas for mock data)
21IT401- BIG DATA ENGINEERING
UNIT-I
36
Running Locally on Test Data
Steps to Run Locally
• Prepare sample test data (CSV/JSON/Parquet)
• Load test data into local file system
• Run code on local execution engine (e.g., Spark local)
• Validate results/output
• Debug and iterate
Example – PySpark Local Testing
• from pyspark.sql import SparkSession spark =
SparkSession.builder.master("local[*]").appName("TestRun").getOrCreate()
df = spark.read.csv("test_data.csv", header=True) df.show()
• Output: View result locally and check transformations
21IT401- BIG DATA ENGINEERING
UNIT-I
37
Running Locally on Test Data
Challenges
• Small data may not show scale issues
• Local environment may lack some distributed
configs
• Not suitable for performance testing
21IT401- BIG DATA ENGINEERING
UNIT-I
38
Running on a Cluster
Running on a cluster enables parallel processing
of large-scale data across multiple machines.
21IT401- BIG DATA ENGINEERING
UNIT-I
39
- Handle massive datasets
- Distributed computing power
- Fault tolerance
- Scalability and performance
21IT401- BIG DATA ENGINEERING
UNIT-I
40
Why Run on a Cluster?
Cluster Components
- Master Node: Job coordination, resource
management
- Worker Nodes: Execute tasks, store data
- Distributed File System: HDFS, Amazon S3
Popular Cluster Frameworks
- Apache Hadoop (MapReduce)
- Apache Spark
- Apache Flink
- Kubernetes for container orchestration
Execution Workflow
1. Submit job to master
2. Job is split into tasks
3. Tasks are assigned to worker nodes
4. Output is collected and written to storage
Monitoring & Logs
- Use tools like YARN Resource Manager UI,
Spark UI
- Logs can be fetched from master or centralized
log systems
Challenges
- Debugging is harder
- Cluster resource contention
- Network latency and node failures
MapReduce Workflow Example
Problem: Count word frequency
1. Map: Emit (word, 1)
2. Shuffle/Sort: Group by word
3. Reduce: Sum values for each word
MapReduce Code Structure
Mapper:
def map(key, value):
for word in value.split():
emit(word, 1)
Reducer:
def reduce(key, values):
emit(key, sum(values))
Workflow with Hadoop
- Input data in HDFS
- MapReduce job submitted to YARN
- Data split into blocks
- Mapper and Reducer tasks executed
- Output written back to HDFS
Chaining MapReduce Jobs
- Output of one job becomes input of the next
- Enables multi-stage data processing
- Example: Cleaning → Transforming →
Aggregating
Tools and Frameworks
- Hadoop MapReduce
- Apache Pig (high-level abstraction)
- Apache Hive (SQL-like queries using
MapReduce)
- Cascading, Crunch
Challenges
- High latency for iterative tasks
- Complex debugging
- Manual optimization required

More Related Content

PPT
Big data engineering - Map and Reduce.ppt
PPTX
Big Data Reverse Knowledge Transfer.pptx
PDF
Lecture 2 part 1
PPTX
Introduction to hadoop and hdfs
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
PPTX
Unit-3.pptx
PPTX
MOD-2 presentation on engineering students
PPTX
Hadoop and It_s Components_PPT .pptx
Big data engineering - Map and Reduce.ppt
Big Data Reverse Knowledge Transfer.pptx
Lecture 2 part 1
Introduction to hadoop and hdfs
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Unit-3.pptx
MOD-2 presentation on engineering students
Hadoop and It_s Components_PPT .pptx

Similar to Big Data Engineering - MAP REDUCE - Unit-II.ppt (20)

PPT
HDFS_architecture.ppt
PPTX
Session 01 - Into to Hadoop
PPT
Hadoop -HDFS.ppt
PDF
Hadoop installation by santosh nage
PPTX
Introduction to HDFS
PDF
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
PPTX
Big Data - A brief introduction
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPT
Big Data Analytics (Collection of Huge Data 2)
PPTX
HADOOP.pptx
PPTX
Big data and Hadoop Section..............
PDF
Tutorial Haddop 2.3
PPTX
Big Data-Session, data engineering and scala
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PPTX
Big data processing using hadoop poster presentation
PPT
Apache hadoop and hive
PPTX
Hadoop File system (HDFS)
PPTX
Module 2 C2_HadoopEcosystemComponents.pptx
PPTX
Apache Hadoop 3.0 Community Update
HDFS_architecture.ppt
Session 01 - Into to Hadoop
Hadoop -HDFS.ppt
Hadoop installation by santosh nage
Introduction to HDFS
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
Big Data - A brief introduction
hdfs readrmation ghghg bigdats analytics info.pdf
Big Data Analytics (Collection of Huge Data 2)
HADOOP.pptx
Big data and Hadoop Section..............
Tutorial Haddop 2.3
Big Data-Session, data engineering and scala
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Big data processing using hadoop poster presentation
Apache hadoop and hive
Hadoop File system (HDFS)
Module 2 C2_HadoopEcosystemComponents.pptx
Apache Hadoop 3.0 Community Update
Ad

Recently uploaded (20)

PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Soil Improvement Techniques Note - Rabbi
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
737-MAX_SRG.pdf student reference guides
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
UNIT - 3 Total quality Management .pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Occupational Health and Safety Management System
PPTX
Artificial Intelligence
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Safety Seminar civil to be ensured for safe working.
Soil Improvement Techniques Note - Rabbi
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
737-MAX_SRG.pdf student reference guides
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Nature of X-rays, X- Ray Equipment, Fluoroscopy
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Abrasive, erosive and cavitation wear.pdf
R24 SURVEYING LAB MANUAL for civil enggi
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
UNIT - 3 Total quality Management .pptx
Fundamentals of Mechanical Engineering.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Occupational Health and Safety Management System
Artificial Intelligence
Exploratory_Data_Analysis_Fundamentals.pdf
Ad

Big Data Engineering - MAP REDUCE - Unit-II.ppt

  • 1. Velammal College of Engineering and Technology (Autonomous) Department of Information Technology 21IT401 BIG DATA ENGINEERING
  • 2. Syllabus UNIT II - MAP REDUCE HDFS Overview, Hadoop and Spark, Map Reduce Programming Basics, Analyzing the data with Hadoop: Java MapReduce - Developing Map Reduce Application - Running Locally on Test Data - Running on a Cluster - MapReduce Workflow 21IT401- BIG DATA ENGINEERING UNIT-II 2
  • 3. HDFS Overview • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. • HDFS is built around the idea that the most efficient data processing pattern is a write- once, read-many-times pattern. 21IT401- BIG DATA ENGINEERING UNIT-I 3
  • 4. HDFS Overview Blocks A disk has a block size, which is the minimum amount of data that it can read or write. File systems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Benefits: • A file can be larger than any single disk in the network. • Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. 21IT401- BIG DATA ENGINEERING UNIT-I 4
  • 5. HDFS Overview Name nodes and Data nodes: • A name node (the master) and a number of data nodes (workers). • The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. • This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The name node also knows the data nodes on which all the blocks for a given file are located. 21IT401- BIG DATA ENGINEERING UNIT-I 5
  • 6. HDFS Overview The Command-Line Interface: • There are many other interfaces to HDFS, but the command line is one of the simplest, and to many developers the most familiar. • There are two properties that we set in the pseudo- distributed configuration • The first is fs.default.name, set to hdfs://localhost/, which is used to set a default file system for Hadoop. • The second property, dfs.replication, to one so that HDFS doesn’t replicate filesystem blocks by the usual default of three. 21IT401- BIG DATA ENGINEERING UNIT-I 6
  • 7. HDFS Overview Basic Filesystem Operations: • HDFS (Hadoop Distributed File System) operations involve various tasks related to storing, accessing, and managing data within the distributed file system. These operations can be broadly classified into write, read, and administrative operations. 21IT401- BIG DATA ENGINEERING UNIT-I 7
  • 8. HDFS Overview Write Operation • File Write Request: – A client initiates a request to write a file to HDFS. • Block Allocation: – The client contacts the NameNode to get a list of DataNodes for each block of the file. – The NameNode returns a list of DataNodes for each block, ensuring the replication factor is maintained. • Block Writing: – The client starts writing the file in chunks (blocks) to the first DataNode in the list. – The first DataNode forwards the block to the next DataNode in the list (pipeline fashion), until the replication factor is met. • Acknowledgement: – Each DataNode in the pipeline sends an acknowledgement back to the client once it has received the block. – Once all DataNodes have acknowledged the block, the client proceeds to the next block. • Commit: – After writing all blocks, the client notifies the NameNode that the file writing is complete. 21IT401- BIG DATA ENGINEERING UNIT-I 8
  • 9. HDFS Overview Read Operation • File Read Request: – A client initiates a request to read a file from HDFS. • Block Location Retrieval: – The client contacts the NameNode to get the locations of the blocks of the file. – The NameNode provides the list of DataNodes that have the blocks. • Block Reading: – The client reads the blocks directly from the DataNodes. – The client can choose the closest DataNode to optimize read performance. 21IT401- BIG DATA ENGINEERING UNIT-I 9
  • 10. HDFS Overview Administrative Operations • Namespace Operations: – Create: Create a new file or directory in the HDFS namespace. – Delete: Delete a file or directory from the HDFS namespace. – Rename: Rename a file or directory. – List: List files and directories within a directory. • Metadata Operations: – The NameNode handles metadata operations, such as maintaining the namespace and mapping of file blocks to DataNodes. – Periodic checkpointing and merging of edit logs with the file system image to ensure metadata consistency. 21IT401- BIG DATA ENGINEERING UNIT-I 10
  • 11. HDFS Overview Administrative Operations • Replication Management: – Block Replication: The NameNode monitors block replication and ensures the configured replication factor is maintained. – Under-replicated Blocks: If the replication factor of a block falls below the desired level, the NameNode schedules replication of the block to other DataNodes. – Over-replicated Blocks: If a block is over-replicated, the NameNode removes the extra replicas. • Data Integrity: – Checksums: HDFS maintains checksums for data blocks and verifies them during read/write operations to ensure data integrity. – Block Reports: DataNodes periodically send block reports to the NameNode, which contains information about the blocks they store. This helps the NameNode maintain an accurate view of the system’s block distribution. 21IT401- BIG DATA ENGINEERING UNIT-I 11
  • 12. HDFS Overview Administrative Operations • Heartbeats: – DataNodes send regular heartbeats to the NameNode to indicate that they are operational. – If a DataNode fails to send a heartbeat within a specified interval, the NameNode marks it as dead and re-replicates its blocks to other DataNodes. • Balancing: – Rebalancer Tool: HDFS provides a rebalancer tool to redistribute data across DataNodes to ensure a balanced distribution and optimal performance. 21IT401- BIG DATA ENGINEERING UNIT-I 12
  • 13. Hadoop and Spark • Apache Spark is an open-source unified analytics engine for big data processing, with built-in modules for SQL, streaming, machine learning, and graph processing. • Developed at UC Berkeley's AMPLab, it aims to provide fast, in-memory data processing. 21IT401- BIG DATA ENGINEERING UNIT-I 13
  • 14. Hadoop and Spark Key Components: • Spark Core: • Spark SQL: • Spark Streaming: • MLlib (Machine Learning Library): • GraphX: • SparkR: 21IT401- BIG DATA ENGINEERING UNIT-I 14
  • 15. Hadoop and Spark Advantages: • Speed: In-memory processing significantly faster than Hadoop's disk-based processing. • Ease of Use: High-level APIs in Java, Scala, Python, and R. • Unified Framework: Supports batch processing, stream processing, machine learning, and graph processing within a single framework. • Flexibility: Can be run on various cluster managers including Hadoop YARN, Apache Mesos, and Kubernetes, or standalone. 21IT401- BIG DATA ENGINEERING UNIT-I 15
  • 16. Hadoop and Spark Disadvantages: • Memory Consumption: Requires substantial memory resources, which can be costly. • Stability: While mature, less mature than the Hadoop ecosystem in terms of certain features and tools. • Complexity in Tuning: Performance tuning can be complex due to numerous configurations and memory management requirements. 21IT401- BIG DATA ENGINEERING UNIT-I 16
  • 17. Hadoop and Spark Integration of Hadoop and Spark • Complementary Use: • Storage: HDFS is often used as the storage layer due to its robust, fault-tolerant distributed storage capabilities. • Processing: Spark can be used as the processing engine due to its speed and efficiency for in-memory processing. 21IT401- BIG DATA ENGINEERING UNIT-I 17
  • 18. Map Reduce Programming Basics There are two daemons associated with Map Reduce Programming: • Job Tracker • Task Tracer 21IT401- BIG DATA ENGINEERING UNIT-I 18
  • 19. Map Reduce Programming Basics Job Tracker: Job Tracker is a master daemon responsible for executing over Map Reduce job. It provides connectivity between Hadoop and application. 21IT401- BIG DATA ENGINEERING UNIT-I 19
  • 20. Map Reduce Programming Basics 21IT401- BIG DATA ENGINEERING UNIT-I 20
  • 21. Map Reduce Programming Basics 21IT401- BIG DATA ENGINEERING UNIT-I 21 TaskTracker: This is responsible for executing individual tasks that is assigned by the Job Tracker. • Task Tracker continuously sends heartbeat message to job tracker. • When a job tracker fails to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and resubmits the task to another available node in the cluster.
  • 23. 21IT401- BIG DATA ENGINEERING UNIT-I 23 A MapReduce programming using Java requires three classes: • 1. Driver Class: This class specifies Job configuration details. • 2. MapperClass: this class overrides the MapFunction based on the problem statement. • 3. Reducer Class: This class overrides the Reduce function based on the problem statement.
  • 24. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 24 Map task takes care of loading, parsing, transforming and filtering. The responsibility of reduce task is grouping and aggregating data that is produced by map tasks to generate final output. Each map task is broken down into the following phases: 1. Record Reader 2. Mapper 3. Combiner 4.Partitioner.
  • 25. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 25 1. RecordReader: converts byte oriented view of input in to Record oriented view and presents it to the Mapper tasks. It presents the tasks with keys and values. i) InputFormat: It reads the given input file and splits using the method getsplits(). ii) Then it defines RecordReader using createRecordReader() which is responsible for generating <keys, value> pairs.
  • 26. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 26 2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader and generates intermediate (key, value) pairs. Methods: - protected void map(KEYIN key, VALUEIN value, Context context): called once for each key-value pair in input split. - void run(Context context): user can override this method for complete control over execution of Mapper.
  • 27. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 27 3. Combiner: It takes intermediate pairs provided by mapper and applies user specific aggregate function to only one mapper. It is also known as local Reducer.
  • 28. Analyzing the data with Hadoop 21IT401- BIG DATA ENGINEERING UNIT-I 28 4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits them into partitions the data using a user-defined condition. • The default behavior is to hash the key to determine the reducer.User can control by using the method: • int getPartition(KEY key, VALUE value, int numPartitions )
  • 29. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 29 • MapReduce is a processing technique and a program model for distributed computing based on java. • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
  • 30. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 30 • MapReduce program executes in two stages, – Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. – Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 31. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 31
  • 32. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 32 Terminologies: • PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job. • Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair. • NamedNode − Node that manages the Hadoop Distributed File System (HDFS). • DataNode − Node where data is presented in advance before any processing takes place.
  • 33. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 33 Terminologies: • MasterNode − Node where JobTracker runs and which accepts job requests from clients. • SlaveNode − Node where Map and Reduce program runs. • JobTracker − Schedules jobs and tracks the assign jobs to Task tracker. • Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
  • 34. Java Map Reduce 21IT401- BIG DATA ENGINEERING UNIT-I 34 Terminologies: • Task Tracker − Tracks the task and reports status to JobTracker. • Job − A program is an execution of a Mapper and Reducer across a dataset. • Task − An execution of a Mapper or a Reducer on a slice of data.
  • 35. Developing Map Reduce Application 21IT401- BIG DATA ENGINEERING UNIT-I 35 • Social networks • Media and Entertainment • Health Care • Business • Banking • Stock Market • Weather Forecasting
  • 36. Running Locally on Test Data Why Run Locally? • Faster feedback loop • Lower resource cost • Easier debugging • Safe from data corruption Common Tools for Local Testing • Apache Spark (local mode) • Hadoop (pseudo-distributed mode) • Docker containers • Python notebooks (PySpark, Pandas for mock data) 21IT401- BIG DATA ENGINEERING UNIT-I 36
  • 37. Running Locally on Test Data Steps to Run Locally • Prepare sample test data (CSV/JSON/Parquet) • Load test data into local file system • Run code on local execution engine (e.g., Spark local) • Validate results/output • Debug and iterate Example – PySpark Local Testing • from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").appName("TestRun").getOrCreate() df = spark.read.csv("test_data.csv", header=True) df.show() • Output: View result locally and check transformations 21IT401- BIG DATA ENGINEERING UNIT-I 37
  • 38. Running Locally on Test Data Challenges • Small data may not show scale issues • Local environment may lack some distributed configs • Not suitable for performance testing 21IT401- BIG DATA ENGINEERING UNIT-I 38
  • 39. Running on a Cluster Running on a cluster enables parallel processing of large-scale data across multiple machines. 21IT401- BIG DATA ENGINEERING UNIT-I 39
  • 40. - Handle massive datasets - Distributed computing power - Fault tolerance - Scalability and performance 21IT401- BIG DATA ENGINEERING UNIT-I 40 Why Run on a Cluster?
  • 41. Cluster Components - Master Node: Job coordination, resource management - Worker Nodes: Execute tasks, store data - Distributed File System: HDFS, Amazon S3
  • 42. Popular Cluster Frameworks - Apache Hadoop (MapReduce) - Apache Spark - Apache Flink - Kubernetes for container orchestration
  • 43. Execution Workflow 1. Submit job to master 2. Job is split into tasks 3. Tasks are assigned to worker nodes 4. Output is collected and written to storage
  • 44. Monitoring & Logs - Use tools like YARN Resource Manager UI, Spark UI - Logs can be fetched from master or centralized log systems
  • 45. Challenges - Debugging is harder - Cluster resource contention - Network latency and node failures
  • 46. MapReduce Workflow Example Problem: Count word frequency 1. Map: Emit (word, 1) 2. Shuffle/Sort: Group by word 3. Reduce: Sum values for each word
  • 47. MapReduce Code Structure Mapper: def map(key, value): for word in value.split(): emit(word, 1) Reducer: def reduce(key, values): emit(key, sum(values))
  • 48. Workflow with Hadoop - Input data in HDFS - MapReduce job submitted to YARN - Data split into blocks - Mapper and Reducer tasks executed - Output written back to HDFS
  • 49. Chaining MapReduce Jobs - Output of one job becomes input of the next - Enables multi-stage data processing - Example: Cleaning → Transforming → Aggregating
  • 50. Tools and Frameworks - Hadoop MapReduce - Apache Pig (high-level abstraction) - Apache Hive (SQL-like queries using MapReduce) - Cascading, Crunch
  • 51. Challenges - High latency for iterative tasks - Complex debugging - Manual optimization required