SlideShare a Scribd company logo
Data
Engineering
and Big Data
Masters
Program
Classification: Public
Quiz
HDFS
Map Reduce
YARN
Hands On- HDFS and MapReduce
Agenda
Classification: Public
Why is big data technology gaining so much attention?
1. To manage high volume of data in cost effective manner
2. To unify different varieties of data spread across
heterogeneous systems
3. To capture data from fast-occurring events
4. To analyze high volume and wide variety of data to
generate valuable insight
Quiz
Classification: Public
Why is big data technology gaining so much attention?
1. To manage high volume of data in cost effective manner
2. To unify different varieties of data spread across
heterogeneous systems
3. To capture data from fast-occurring events
4. To analyze high volume and wide variety of data to
generate valuable insight
Ans: All the above
Quiz
Classification: Public
Which of the following is not a challenge associated with
Big Data?
1. High Volume
2. Large Velocity
3. Wide Variety
4. Viscosity of data
Quiz (Contd.)
Classification: Public
Which of the following is not a challenge associated with
Big Data?
1. High Volume
2. Large Velocity
3. Wide Variety
4. Viscosity of data
Ans: 4. Viscosity of data
Quiz (Contd.)
Classification: Public
What are the challenges of scaling up?
1. Complexity
2. Costly
3. Less Reliability
4. Less computational power
Quiz
Classification: Public
What are the challenges of scaling up?
1. Complexity
2. Costly
3. Less Reliability
4. Less computational power
Ans: 1. Complex 2. Costly 3. Less Reliability
Quiz
Classification: Public
What are the challenges of scaling out?
1. Low storage capacity
2. Coordination between networked machines
3. Handling failures of machines
4. Poor performance
Quiz (Contd.)
Classification: Public
What are the challenges of scaling out?
1. Low storage capacity
2. Coordination between networked machines
3. Handling failures of machines
4. Poor performance
Ans: 2. Coordination between networked machines
3. Handling failures of machines
Quiz (Contd.)
Classification: Public
• The file store in HDFS provides scalable, fault tolerant storage at low cost.
• The HDFS software detects and compensates for hardware issues, including disk problems and server
failure.
• HDFS stores file across the collection of servers in a cluster.
• Files are decomposed into the blocks and each block is written to more than one of the servers.
• The replication provides both fault tolerance and performance.
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem such as “ext3”,“ext4″or “xfs”
• Provides redundant storage for massive amounts of data
• Using “readily/available,”industry/standard”compute
HDFS - Hadoop Distributed File System
Classification: Public
HDFS has been designed keeping in view the following features:
•Very large files: Files that are megabytes, gigabytes, terabytes or petabytes of size.
•Data access: HDFS is built around the idea that data is written once but read many times. A dataset is
copied from source and then analysis is done on that dataset over time.
•Commodity hardware: Hadoop does not require expensive, highly reliable hardware as it is designed
to run on clusters of commodity hardware.
•Growth of storage vs read/write performance- One hard drive in 1990 could store 1,370 MB of data
and
had a transfer speed of 4.4 MB/s so full data can be read in five minutes.Now with 1 terabyte drive
transfer speed is around 100 MB/s But it takes more than two and a half hours to read all the data off
the disk.
•Although the storage capacities of hard drives have increased, yet access speeds have not kept up
with the same cost spending.
Design of HDFS
Classification: Public
Hard Disk has concentric circles which form tracks.
• One file can contain many blocks. These blocks in local file system are nearly 512 bytes and not
necessarily continuous.
• For HDFS, since it is designed for large files, block size is 128 MB by default. Moreover, it gets blocks
of local file system contiguously to minimise head seek time
HDFS Blocks
Classification: Public
NameNode
•Contains Hadoop FileSystem
•Tree and other metadata information about files and directories.
•Contains in memory mapping of which blocks are stored in which
datanode
Secondary Namenode
•Performs house-keeping activities for namenode, like periodic
merging of namespace and edits.
•This is not a back up for namenode
DataNode
•Stores actual data blocks of file in HDFS on its own local disk.
•Sends signals to NameNode periodically (called as Heartbeat) to
verify it is active.
•Sends block reporting to the namenode on cluster startup as well
as periodically at every 10th Heartbeat.
•The data node are the workhorse of the system.
Components of Hadoop 1.x
•They perform all the block operation including periodic checksum. They receive instructions from the name
node of where to put the blocks and how to put the blocks.
Edge Node (Not mandatory)– Actual client libraries to run the code/big data application, it is kept separate
to minimize load on name node and data nodes.
Classification: Public
HDFS Commands hdfs dfs -help
Commands
Read Commands Demo Write Commands Demo
● cat
● checksum
● ls
● text
● appendToFile
● copyFromLocal
● put
● moveFromLocal
● copyToLocal
● get
● cp
● mkdir
● mv
● rm
Classification: Public
HDFS Architecture
Classification: Public
NameNode contains two important files on its hard disk:
1.fsimage(file system image)
It contains:
•all directory structure of HDFS
•replication level of file
•modification and access times of files
•access permissions of files and directories
•block size of files
•the blocks constituting a file
•A Transaction Log-Records file creations, file deletions etc.
2.Edits
•When any write operation takes place in HDFS, the directory
structure gets modified.
•These modifications are stored in memory as well as in edits
files (edits files are stored on hard disk).
•If existing fsimage file gets merged with edits, we’ll get updated
fsimage file.
•This process is called checkpointing and is carried out by
Secondary Namenode
Daemons of Hadoop 1.x
Classification: Public
Safe Mode:
•During start up, the NameNode loads the file system state from the fsimage and the edits log file.
•It then waits for DataNodes to report their blocks. During this time, NameNode stays in Safemode
•Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does
not allow any modifications to file system or blocks
Replica Placement
How does the namenode choose which data nodes to store replicas on?
● Placing all replicas on a single node incurs the lowest write bandwidth
penalty (since the replication pipeline runs on a single node)
● But this offers no real redundancy (if the node fails, the data for that block
is lost).
● Also, the read bandwidth is high for off-rack reads.
● At the other extreme, placing replicas in different data centres may
maximize redundancy, but at the cost of write bandwidth.
● Hadoop’s default strategy is to place the first replica on the same node as
the client
● For clients running outside the cluster, a node is chosen at random.
Cluster = Name Node + Secondary Name Node+ Data Nodes
Cluster = Name Node + Secondary Name Node+ Data Nodes+ Edge Node
● The system tries not to pick nodes that are too full or too busy.
● The second replica is placed on a different rack from the first (off-rack),
chosen at random.
● The third replica is placed on the same rack as the second, but on a
different node chosen at random.
● Further replicas are placed on random nodes in the cluster, although the
system tries to avoid placing too many replicas on the same rack.
HDFS
Classification: Public
This strategy gives a good balance among:
● reliability (blocks are stored on two racks, so data is available even in case of node or rack failure)
● write bandwidth (writes only have to traverse a single network switch)
● read performance (there’s a choice of two racks to read from)
● block distribution across the cluster (clients only write a single block on the local rack)
Balancer:
A tool that analyzes block placement and re-balances data across the DataNodes.
Goal: disk full on DataNodes should be similar
● Usually run when new DataNodes are added
● Cluster is online when rebalancer is active
● Rebalancer is throttled to avoid network congestion
● Command line tool
Benefits of Replica Placement and
Rack Awareness
Classification: Public
• Windows
https://medium.com/analytics-vidhya/hadoop-setting-up-a-
single-node-cluster-in-windows-4221aab69aa6
• Linux
https://www.geeksforgeeks.org/how-to-install-hadoop-in-
linux/
• Mac
https://towardsdatascience.com/installing-hadoop-on-a-
mac-ec01c67b003c
Hadoop Installation Guide

More Related Content

PPTX
Cloud Computing - Cloud Technologies and Advancements
PPTX
Data Analytics presentation.pptx
PDF
Hadoop data management
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Hadoop Distributed File System
PDF
Chapter2.pdf
PPTX
HADOOP.pptx
PDF
Hdfs architecture
Cloud Computing - Cloud Technologies and Advancements
Data Analytics presentation.pptx
Hadoop data management
Unit-1 Introduction to Big Data.pptx
Hadoop Distributed File System
Chapter2.pdf
HADOOP.pptx
Hdfs architecture

Similar to Big Data-Session, data engineering and scala (20)

PPTX
Hadoop File System.pptx
PPTX
Introduction to hadoop and hdfs
PPTX
Hadoop Distributed File System
PPTX
Hadoop and HDFS
PPTX
Module 2 - Part2.pptx
PPTX
module 2.pptx
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPT
Hadoop training in bangalore
PPTX
Big data and Hadoop Section..............
PPTX
Hadoop
PPTX
PPT
Hadoop Technology
PDF
Hadoop introduction
PDF
getFamiliarWithHadoop
PPTX
PPTX
PPT
Hadoop-professional-software-development-course-in-mumbai
PPT
Hadoop professional-software-development-course-in-mumbai
PPTX
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Hadoop File System.pptx
Introduction to hadoop and hdfs
Hadoop Distributed File System
Hadoop and HDFS
Module 2 - Part2.pptx
module 2.pptx
hdfs readrmation ghghg bigdats analytics info.pdf
Topic 9a-Hadoop Storage- HDFS.pptx
Hadoop training in bangalore
Big data and Hadoop Section..............
Hadoop
Hadoop Technology
Hadoop introduction
getFamiliarWithHadoop
Hadoop-professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Ad

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Construction Project Organization Group 2.pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Sustainable Sites - Green Building Construction
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Welding lecture in detail for understanding
PPT
Project quality management in manufacturing
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
Mechanical Engineering MATERIALS Selection
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Sustainable Sites - Green Building Construction
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Welding lecture in detail for understanding
Project quality management in manufacturing
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Lesson 3_Tessellation.pptx finite Mathematics
CYBER-CRIMES AND SECURITY A guide to understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Internet of Things (IOT) - A guide to understanding
Ad

Big Data-Session, data engineering and scala

  • 3. Classification: Public Why is big data technology gaining so much attention? 1. To manage high volume of data in cost effective manner 2. To unify different varieties of data spread across heterogeneous systems 3. To capture data from fast-occurring events 4. To analyze high volume and wide variety of data to generate valuable insight Quiz
  • 4. Classification: Public Why is big data technology gaining so much attention? 1. To manage high volume of data in cost effective manner 2. To unify different varieties of data spread across heterogeneous systems 3. To capture data from fast-occurring events 4. To analyze high volume and wide variety of data to generate valuable insight Ans: All the above Quiz
  • 5. Classification: Public Which of the following is not a challenge associated with Big Data? 1. High Volume 2. Large Velocity 3. Wide Variety 4. Viscosity of data Quiz (Contd.)
  • 6. Classification: Public Which of the following is not a challenge associated with Big Data? 1. High Volume 2. Large Velocity 3. Wide Variety 4. Viscosity of data Ans: 4. Viscosity of data Quiz (Contd.)
  • 7. Classification: Public What are the challenges of scaling up? 1. Complexity 2. Costly 3. Less Reliability 4. Less computational power Quiz
  • 8. Classification: Public What are the challenges of scaling up? 1. Complexity 2. Costly 3. Less Reliability 4. Less computational power Ans: 1. Complex 2. Costly 3. Less Reliability Quiz
  • 9. Classification: Public What are the challenges of scaling out? 1. Low storage capacity 2. Coordination between networked machines 3. Handling failures of machines 4. Poor performance Quiz (Contd.)
  • 10. Classification: Public What are the challenges of scaling out? 1. Low storage capacity 2. Coordination between networked machines 3. Handling failures of machines 4. Poor performance Ans: 2. Coordination between networked machines 3. Handling failures of machines Quiz (Contd.)
  • 11. Classification: Public • The file store in HDFS provides scalable, fault tolerant storage at low cost. • The HDFS software detects and compensates for hardware issues, including disk problems and server failure. • HDFS stores file across the collection of servers in a cluster. • Files are decomposed into the blocks and each block is written to more than one of the servers. • The replication provides both fault tolerance and performance. • HDFS is a filesystem written in Java • Sits on top of a native filesystem such as “ext3”,“ext4″or “xfs” • Provides redundant storage for massive amounts of data • Using “readily/available,”industry/standard”compute HDFS - Hadoop Distributed File System
  • 12. Classification: Public HDFS has been designed keeping in view the following features: •Very large files: Files that are megabytes, gigabytes, terabytes or petabytes of size. •Data access: HDFS is built around the idea that data is written once but read many times. A dataset is copied from source and then analysis is done on that dataset over time. •Commodity hardware: Hadoop does not require expensive, highly reliable hardware as it is designed to run on clusters of commodity hardware. •Growth of storage vs read/write performance- One hard drive in 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s so full data can be read in five minutes.Now with 1 terabyte drive transfer speed is around 100 MB/s But it takes more than two and a half hours to read all the data off the disk. •Although the storage capacities of hard drives have increased, yet access speeds have not kept up with the same cost spending. Design of HDFS
  • 13. Classification: Public Hard Disk has concentric circles which form tracks. • One file can contain many blocks. These blocks in local file system are nearly 512 bytes and not necessarily continuous. • For HDFS, since it is designed for large files, block size is 128 MB by default. Moreover, it gets blocks of local file system contiguously to minimise head seek time HDFS Blocks
  • 14. Classification: Public NameNode •Contains Hadoop FileSystem •Tree and other metadata information about files and directories. •Contains in memory mapping of which blocks are stored in which datanode Secondary Namenode •Performs house-keeping activities for namenode, like periodic merging of namespace and edits. •This is not a back up for namenode DataNode •Stores actual data blocks of file in HDFS on its own local disk. •Sends signals to NameNode periodically (called as Heartbeat) to verify it is active. •Sends block reporting to the namenode on cluster startup as well as periodically at every 10th Heartbeat. •The data node are the workhorse of the system. Components of Hadoop 1.x •They perform all the block operation including periodic checksum. They receive instructions from the name node of where to put the blocks and how to put the blocks. Edge Node (Not mandatory)– Actual client libraries to run the code/big data application, it is kept separate to minimize load on name node and data nodes.
  • 15. Classification: Public HDFS Commands hdfs dfs -help Commands Read Commands Demo Write Commands Demo ● cat ● checksum ● ls ● text ● appendToFile ● copyFromLocal ● put ● moveFromLocal ● copyToLocal ● get ● cp ● mkdir ● mv ● rm
  • 17. Classification: Public NameNode contains two important files on its hard disk: 1.fsimage(file system image) It contains: •all directory structure of HDFS •replication level of file •modification and access times of files •access permissions of files and directories •block size of files •the blocks constituting a file •A Transaction Log-Records file creations, file deletions etc. 2.Edits •When any write operation takes place in HDFS, the directory structure gets modified. •These modifications are stored in memory as well as in edits files (edits files are stored on hard disk). •If existing fsimage file gets merged with edits, we’ll get updated fsimage file. •This process is called checkpointing and is carried out by Secondary Namenode Daemons of Hadoop 1.x
  • 18. Classification: Public Safe Mode: •During start up, the NameNode loads the file system state from the fsimage and the edits log file. •It then waits for DataNodes to report their blocks. During this time, NameNode stays in Safemode •Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks Replica Placement How does the namenode choose which data nodes to store replicas on? ● Placing all replicas on a single node incurs the lowest write bandwidth penalty (since the replication pipeline runs on a single node) ● But this offers no real redundancy (if the node fails, the data for that block is lost). ● Also, the read bandwidth is high for off-rack reads. ● At the other extreme, placing replicas in different data centres may maximize redundancy, but at the cost of write bandwidth. ● Hadoop’s default strategy is to place the first replica on the same node as the client ● For clients running outside the cluster, a node is chosen at random. Cluster = Name Node + Secondary Name Node+ Data Nodes Cluster = Name Node + Secondary Name Node+ Data Nodes+ Edge Node ● The system tries not to pick nodes that are too full or too busy. ● The second replica is placed on a different rack from the first (off-rack), chosen at random. ● The third replica is placed on the same rack as the second, but on a different node chosen at random. ● Further replicas are placed on random nodes in the cluster, although the system tries to avoid placing too many replicas on the same rack. HDFS
  • 19. Classification: Public This strategy gives a good balance among: ● reliability (blocks are stored on two racks, so data is available even in case of node or rack failure) ● write bandwidth (writes only have to traverse a single network switch) ● read performance (there’s a choice of two racks to read from) ● block distribution across the cluster (clients only write a single block on the local rack) Balancer: A tool that analyzes block placement and re-balances data across the DataNodes. Goal: disk full on DataNodes should be similar ● Usually run when new DataNodes are added ● Cluster is online when rebalancer is active ● Rebalancer is throttled to avoid network congestion ● Command line tool Benefits of Replica Placement and Rack Awareness
  • 20. Classification: Public • Windows https://medium.com/analytics-vidhya/hadoop-setting-up-a- single-node-cluster-in-windows-4221aab69aa6 • Linux https://www.geeksforgeeks.org/how-to-install-hadoop-in- linux/ • Mac https://towardsdatascience.com/installing-hadoop-on-a- mac-ec01c67b003c Hadoop Installation Guide