Big Data-Session, data engineering and scala

Data
Engineering
and Big Data
Masters
Program

Classification: Public
Quiz
HDFS
Map Reduce
YARN
Hands On- HDFS and MapReduce
Agenda

Why is big data technology gaining so much attention?
1. To manage high volume of data in cost effective manner
2. To unify different varieties of data spread across
heterogeneous systems
3. To capture data from fast-occurring events
4. To analyze high volume and wide variety of data to
generate valuable insight
Quiz

Why is big data technology gaining so much attention?
1. To manage high volume of data in cost effective manner
2. To unify different varieties of data spread across
heterogeneous systems
3. To capture data from fast-occurring events
4. To analyze high volume and wide variety of data to
generate valuable insight
Ans: All the above
Quiz

Which of the following is not a challenge associated with
Big Data?
1. High Volume
2. Large Velocity
3. Wide Variety
4. Viscosity of data
Quiz (Contd.)

Which of the following is not a challenge associated with
Big Data?
1. High Volume
2. Large Velocity
3. Wide Variety
4. Viscosity of data
Ans: 4. Viscosity of data
Quiz (Contd.)

What are the challenges of scaling up?
1. Complexity
2. Costly
3. Less Reliability
4. Less computational power
Quiz

What are the challenges of scaling up?
1. Complexity
2. Costly
3. Less Reliability
4. Less computational power
Ans: 1. Complex 2. Costly 3. Less Reliability
Quiz

What are the challenges of scaling out?
1. Low storage capacity
2. Coordination between networked machines
3. Handling failures of machines
4. Poor performance
Quiz (Contd.)

What are the challenges of scaling out?
1. Low storage capacity
2. Coordination between networked machines
4. Poor performance
Ans: 2. Coordination between networked machines
Quiz (Contd.)

• The file store in HDFS provides scalable, fault tolerant storage at low cost.
• The HDFS software detects and compensates for hardware issues, including disk problems and server
failure.
• HDFS stores file across the collection of servers in a cluster.
• Files are decomposed into the blocks and each block is written to more than one of the servers.
• The replication provides both fault tolerance and performance.
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem such as “ext3”,“ext4″or “xfs”
• Provides redundant storage for massive amounts of data
• Using “readily/available,”industry/standard”compute
HDFS - Hadoop Distributed File System

HDFS has been designed keeping in view the following features:
•Very large files: Files that are megabytes, gigabytes, terabytes or petabytes of size.
•Data access: HDFS is built around the idea that data is written once but read many times. A dataset is
copied from source and then analysis is done on that dataset over time.
•Commodity hardware: Hadoop does not require expensive, highly reliable hardware as it is designed
to run on clusters of commodity hardware.
•Growth of storage vs read/write performance- One hard drive in 1990 could store 1,370 MB of data
and
had a transfer speed of 4.4 MB/s so full data can be read in five minutes.Now with 1 terabyte drive
transfer speed is around 100 MB/s But it takes more than two and a half hours to read all the data off
the disk.
•Although the storage capacities of hard drives have increased, yet access speeds have not kept up
with the same cost spending.
Design of HDFS

Hard Disk has concentric circles which form tracks.
• One file can contain many blocks. These blocks in local file system are nearly 512 bytes and not
necessarily continuous.
• For HDFS, since it is designed for large files, block size is 128 MB by default. Moreover, it gets blocks
of local file system contiguously to minimise head seek time
HDFS Blocks

NameNode
•Contains Hadoop FileSystem
•Tree and other metadata information about files and directories.
•Contains in memory mapping of which blocks are stored in which
datanode
Secondary Namenode
•Performs house-keeping activities for namenode, like periodic
merging of namespace and edits.
•This is not a back up for namenode
DataNode
•Stores actual data blocks of file in HDFS on its own local disk.
•Sends signals to NameNode periodically (called as Heartbeat) to
verify it is active.
•Sends block reporting to the namenode on cluster startup as well
as periodically at every 10th Heartbeat.
•The data node are the workhorse of the system.
Components of Hadoop 1.x
•They perform all the block operation including periodic checksum. They receive instructions from the name
node of where to put the blocks and how to put the blocks.
Edge Node (Not mandatory)– Actual client libraries to run the code/big data application, it is kept separate
to minimize load on name node and data nodes.

HDFS Commands hdfs dfs -help
Commands
Read Commands Demo Write Commands Demo
● cat
● checksum
● ls
● text
● appendToFile
● copyFromLocal
● put
● moveFromLocal
● copyToLocal
● get
● cp
● mkdir
● mv
● rm

HDFS Architecture

NameNode contains two important files on its hard disk:
1.fsimage(file system image)
It contains:
•all directory structure of HDFS
•replication level of file
•modification and access times of files
•access permissions of files and directories
•block size of files
•the blocks constituting a file
•A Transaction Log-Records file creations, file deletions etc.
2.Edits
•When any write operation takes place in HDFS, the directory
structure gets modified.
•These modifications are stored in memory as well as in edits
files (edits files are stored on hard disk).
•If existing fsimage file gets merged with edits, we’ll get updated
fsimage file.
•This process is called checkpointing and is carried out by
Secondary Namenode
Daemons of Hadoop 1.x

Safe Mode:
•During start up, the NameNode loads the file system state from the fsimage and the edits log file.
•It then waits for DataNodes to report their blocks. During this time, NameNode stays in Safemode
•Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does
not allow any modifications to file system or blocks
Replica Placement
How does the namenode choose which data nodes to store replicas on?
● Placing all replicas on a single node incurs the lowest write bandwidth
penalty (since the replication pipeline runs on a single node)
● But this offers no real redundancy (if the node fails, the data for that block
is lost).
● Also, the read bandwidth is high for off-rack reads.
● At the other extreme, placing replicas in different data centres may
maximize redundancy, but at the cost of write bandwidth.
● Hadoop’s default strategy is to place the first replica on the same node as
the client
● For clients running outside the cluster, a node is chosen at random.
Cluster = Name Node + Secondary Name Node+ Data Nodes
Cluster = Name Node + Secondary Name Node+ Data Nodes+ Edge Node
● The system tries not to pick nodes that are too full or too busy.
● The second replica is placed on a different rack from the first (off-rack),
chosen at random.
● The third replica is placed on the same rack as the second, but on a
different node chosen at random.
● Further replicas are placed on random nodes in the cluster, although the
system tries to avoid placing too many replicas on the same rack.
HDFS

This strategy gives a good balance among:
● reliability (blocks are stored on two racks, so data is available even in case of node or rack failure)
● write bandwidth (writes only have to traverse a single network switch)
● read performance (there’s a choice of two racks to read from)
● block distribution across the cluster (clients only write a single block on the local rack)
Balancer:
A tool that analyzes block placement and re-balances data across the DataNodes.
Goal: disk full on DataNodes should be similar
● Usually run when new DataNodes are added
● Cluster is online when rebalancer is active
● Rebalancer is throttled to avoid network congestion
● Command line tool
Benefits of Replica Placement and
Rack Awareness

• Windows
https://medium.com/analytics-vidhya/hadoop-setting-up-a-
single-node-cluster-in-windows-4221aab69aa6
• Linux
https://www.geeksforgeeks.org/how-to-install-hadoop-in-
linux/
• Mac
https://towardsdatascience.com/installing-hadoop-on-a-
mac-ec01c67b003c
Hadoop Installation Guide

Big Data-Session, data engineering and scala

More Related Content

Similar to Big Data-Session, data engineering and scala (20)

Recently uploaded (20)

Big Data-Session, data engineering and scala