SlideShare a Scribd company logo
International Journal of Modern Research in Engineering and Technology (IJMRET)
www.ijmret.org Volume 3 Issue 6 ǁ June 2018.
w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 33
Hadoop Cluster Analysis and Assessment
Saliha KESKİN1
, Atilla ERGÜZEN2
1
(Department of Computer Engineering, Kırıkkale University, Turkey)
ABSTRACT:Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
KEYWORDS -Bigdata, Hadoop, MapReduce, HDFS
I. INTRODUCTION
We are in a Big Data era where large
amounts of data are collected and processed in
terabytes or petabytes from various sectors. Big
Data refers data that can not be processed or
analyzed using conventional methods.
These big data, which can not be
processed by conventional data processing
techniques and stored in the conventional database,
lead institutions and organizations to a technology
that provides a fast and efficient data processing
model. Nowadays, Big Data has become a
necessity in many areas.
In Chapter 2, Hadoop framework and its
some components are examined.In Chapter 3,
information about Hadoop cluster setup is given
and the applications and results for the cluster test
are presented. Finally, the results of the paper is
indicated in Chapter 4.
II. MATERIALS AND METHODS
With the development of technology,
hardware prices are falling, but storage costs are
increasing as the volume and variety of the data
increases. Day-by-day data's level can reach
exabytes or even zettabyte. The data from many
different sectors such as aeronautics, meteorology,
IOT applications, health, distance education and
energy sectors are obtained[1–2]. Dataset is
growing, and RDBMSs are not suitable for storing
and managing this large dataset[3].
New techniques and technologies are
developing to store and analyze big data.Until now,
scientists have developed various techniques and
technologies to collect, cluster, analyze and
visualize the big data.
Hadoop is one of the famous and powerful
Big Data tools. It provides infrastructure and
platforms for other specialized Big Data
applications.A number of Big Data systems are
built on top of Hadoop and have many uses in
different areas such as data mining and machine
learning [4]. Also, thanks to Hadoop's multi-node
structure, the storage system has a stronger
structure and more performance is achieved with
the Hadoop file system [5].
2.1. HADOOP
Hadoop [6] is an open-source software
framework that allows processing of large amounts
of data in a distributed computing environment.
Rather than expensive equipment and dependency
on different systems to store and process data, it
allows inexpensive parallel processing with Big
Data.
Companies need to use terabytes or
petabytes of data to understand specific queries and
requests from users. Existing tools are insufficient
to handle big data sets. Hadoop provided a solution
to this large amount of data analysis problem.
Hadoop is a convenient tool for solving all
the difficulties of big data. It can process large
amounts of data in the petabyte range. It provides
great convenience with scalability and also ensures
continuous operation with high fault tolerance. It
provides reliability for the data by creating multiple
copies of the data (replication) in different
locations (nodes) throughout the cluster.
International Journal of Modern Research in Engineering and Technology (IJMRET)
www.ijmret.org Volume 3 Issue 6 ǁ June 2018.
w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 34
Hadoop has two components, HDFS
(Hadoop distributed file system) and
Mapreduce.While Mapreduce is used for data
processing, HDFS is used for data storage[2].
2.1.1 MAPREDUCE
Today, there are many different hardware
architectures that support parallel computing. There
are two popular examples: (1) Multi-core
computing: A multi-core processor is a processor
that contains more than one computer core on a
single chip;and (2) Distributed computing:
Distributed computing systems enable independent
computers to communicate with a common
network for parallel computing [7].
Mapreduce [8], is a programming model
used to processlarge data sets in cluster by
providing parallelization and distribution of
calculations based on several processors.It was
inspired by the map and reduce functions in the
functional programming. In MapReduce tasks, data
is handled by these two functions. (Fig 1). In the
map phase, Hadoop splits the data into small pieces
and distributes them to the slave nodes (datanodes).
Therefore, when the map tasks are run in parallel,
the data processing load is balanced. To further
split the piece, a datanode can reapply the map
task.Datanode handles the little problem it gets and
returns a list in the form of a binary (key, value) as
a result.Reduce phase, the outputs of the map task
are collected and combined to form the result of the
problem they want to solve[9].
Figure 1. MapReduce diagram [10]
MapReduce greatly simplifies a large-
scale analysis task on distributed data.It is much
easier to process an application than to process it
many times, with the MapReduce form being
implemented more easily with thousands of
machines in the cluster. This feature makes the
MapReduce programming model more attractive.
The Mapreduce programming model [11]
consists of two components that control the work
execution process:the central component
JobTracker and the distributed components
TaskTrackers. When a MapReduce application is
sent to Hadoop, the application is delivered to
JobTracker. JobTracker does the job scheduling
and distributes the job to TaskTrackers. Each
TaskTracker performs the given operation and
sends information to JobTracker about the progress
of the job.
2.1.2. HDFS
Hadoop Distributed File System (HDFS),
which allows distribution and storage of data
between Hadoop clusters, is a file system that
provides high speed access to large amounts of data
[7].
HDFS architecture adopts a master / slave
architecture that includes a single NameNode and
all other nodes DataNodes.When the data are
stored, they are stored separately as metadata and
application data. While the NameNode stores the
meta data, the DataNodes do the actual storage
jobs. NameNode holds meta data for each file
stored in HDFS in the main memory. These
metadata includes a list of "stored filenames,
corresponding blocks of each file, and Datanodes
containing these blocks".For this reason, when a
client reads a file, it first communicates with
Namenode to get the locations of the data blocks
that make up the file, and Namenode directs the
client to the Datanode cluster hosting the requested
file. The client then communicates directly with
Datanode to perform file operations [12] [13].
The files in HDFS are divided into smaller
blocks, typically 64 MB block size, and these
divided blocks are duplicated and distributed to
various Datanodes, as shown in Fig 2. Although
HDFS has many similarities with existing
distributed file systems, it is different.Compared to
traditional distributed file systems, HDFS has two
important advantages: (1) has high fault tolerance
and is usually developed for low cost
hardware.Datanodes send a "signal indicating that
it is alive", which is called continuous heartbeat for
storage space and malfunction detection. When a
hardware or software problem occurs in a
DataNode, Namenode detects it and reassigns the
job to the other Datanode; (2) It provides effective
access to data and is suitable for applications with
big data sets.
International Journal of Modern Research in Engineering and Technology (IJMRET)
www.ijmret.org Volume 3 Issue 6 ǁ June 2018.
w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 35
Figure2. HDFS file block structure
III. PERFORMANCE EVALUATIONS
Using Oracle VirtualBox, four virtual
machines have been installed in the Ubuntu
operating system, one of which have been master
node and the other three have been slave nodes.
While slave nodes are 4 GB memory, master node
is 8 GB memory, and each nodes are also
configured as 20 GB SSDs.
Each node has a Java Development Kit
jdk.1.8.0 and Hadoop 2.7.2 installed.SSH (Secure
Shell) is used to access other nodes in the cluster
from the master node.Once the IP settings and
Hadoop configurations are done, the Hadoop
MapReduce function is run.In this section, Pi
calculation and Grep algorithms are used.
3.1 PI CALCULATE
The area of the circle segment in the unit
square shown in Fig. 3 is π / 4. If N points are
randomly selected in the square, it means that
approximately N * π / 4 of these points fall into the
circle.This program selects random points inside
the square and then checks whether the point is
inside the circle (x2 + y2 <R2, it is in the circle, x
and y points are the coordinates and R is radius of
the circle).With the Quasi-Monte-Carlo method,
the Pi estimation account is measured in terms of
all points (N) and how many of these points are in
the circle (M) (Pi = 4 * M / N) [14].
In this application, each map function
forms some random points in the square, counting
how many points are inside / outside the circle
segment.The reduce function adds up all points and
points in the circle to calculate the number of Pi.
Figure 3. Monte Carlo Pi Calculation [15]
The first parameter entered in the
application is the map number and the second
parameter is the number of samples to be used for
each map.Figure 7 shows the operation of the Pi
application with 8 map tasks and 10000 samples.
1, 2 and 4 nodes were used for the test of
Pi application, respectively. In all tests 8 map tasks
and 1000 samples were used. The run times of the
executed Pi application are given in Figure 4.
According to the results, as the number of nodes
increases, the duration of study decreases.
Figure4. Pi test on different node numbers
We also observed the results of Pi
calculation using different map and sample
numbers. According to at the working times shown
in Fig. 5, it is seen that the working time increases
when both the map number increases, and the
number of samples run for each map increases.
Figure 5. Pi test in different map / sample numbers
0
10
20
30
40
50
1 2 4
Runtime(sn)
Numberof Nodes
0
500
1000
1500
1 10 100 1000 10000
RunTime(sn)
Numberof Maps
100000000
10000000
Number of Samples
International Journal of Modern Research in Engineering and Technology (IJMRET)
www.ijmret.org Volume 3 Issue 6 ǁ June 2018.
w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 36
3.2. GREP APPLICATION
The Grep application, used as a search
tool, counts matches of regular expressions in the
given input. Each map task outputs lines containing
one of the models. The reduce task collects the
numbers from the map tasks and produces output in
pairs.Figure 9 shows a Grep study that interrogates
all expressions that start with "v" in the input.
The Grep application has been tested in
file sizes of 100 MB, 500 MB and 1 GB. During
the operation, first sorting is performed in the
document and then the desired expression search is
performed.The run times of the Grep application
run in Figure 6 are shown. The Grep output of a
100 MB file is shown in Figure 10. According to
the test results, as the file size increases, the
processing time increases.
Figure 6. Execute times of Grep tests
IV. CONCLUSION
Today, the digital world is growing
exponentially.Every second, data that cannot be
directly managed or used with traditional data
management tools are produced.In this article, we
have reviewed the Hadoop architecture, which has
several interfaces to create tools and applications
for harmonizing data from many different data
sources.The Hadoop MapReduce programming
model and HDFS are increasingly being used to
process and store big data sets.
In this study, a four-node hadoop cluster
was created using virtualization platform.In these
nodes, performance analysis was performed by
applying various parameters with the
benchmarking tools available under the Hadoop
framework.We applied the Grep algorithm and Pi
estimation algorithm for different node numbers
and different file size and analyzed the
results.Adding more nodes resulted in increased
performance of the cluster; With a growing number
of Data node, Hadoop cluster performance can be
improved. Moreover, as the size of the input data
increases, the performance of the runtime
increases.
REFERENCES
[1] M. Ünver, E. Erdal, A. Ergüzen, "Big Data Example
In Web Based Learning Management Systems", International
Journal of Advanced Computational Engineering and
Networking, ISSN(p): 2320-2106, ISSN(e): 2321-2063 Volume-
6, Issue-2, Feb.-2018
[2] A. Ergüzen, E. Erdal, "Medical Image Archiving
System Implementation with Lossless Region of Interest and
Optical Character Recognition", Journal of Medical Imaging
and Health Informatics Vol. 7, 1246–1252, 2017
[3] A. Erguzen, E. Erdal, M. Unver,"Big Data
Challenges And Opportunities In Distance Education",
International Journal of Advanced Computational Engineering
and Networking, ISSN(p): 2320-2106, ISSN(e): 2321-2063
Volume-6, Issue-2, Feb.-2018
[4] P. Chen, C-Y Zhang, "Data-intensive applications,
challenges, techniques and technologies: A survey on Big Data"
Information SciencesVolume 275, Page 314-347,10 August
2014
[5] A. Ergüzen, E. Erdal, "An Efficient Middle Layer
Platform for Medical Imaging Archives",Journal of Healthcare
Engineering, 9 May 2018
[6] Harness the Power of Big Data - The IBM Big Data
Platform, USA 2012, s.87-88
[7] T. Huang, L. Lan, X. Fang, P. An, J. Min, F. Wang
"Promises and Challenges of Big Data Computing in Health
Sciences" Big Data ResearchVolume 2, Issue 1, March 2015,
Pages 2-11
[8] J. Dean, S. Ghemawat "MapReduce: Simplified
Data Processing on Large Clusters", Magazine Communications
of the ACM,Volume 51 Issue 1, January 2008 Pages 107-113
[9] R. Nanduri, N. Maheshwari, A. Reddyraja, V. Varma
"Job Aware Scheduling Algorithm for MapReduce Framework"
Cloud Computing Technology and Science (CloudCom), IEEE
Third International Conference2011
[10] J. Kun, https://jeremykun.com/2014/10/05/on-the-
computational-complexity-of-mapreduce/ October 5, 2014
[11] J. Dhok and V. Varma, “Using pattern classification
for task assignment in Mapreduce” in ISEC, 2010
[12] S. J. Andaloussi, A. Sekkaki "Medical Content Based
Image Retrieval by Using the Hadoop Framework",
Telecommunications (ICT), 2013 20th International Conference,
2013
[13] M. B. Bisane , F. Pushpanjali, M. Chouragade,
"Improving Access Efficiency of Small Files in HDFS"
International Journal of Scientific & Engineering Research,
Volume 7, Issue 2, February-2016
[14] J. H. Mathews, "Module for Monte Carlo Pi",
http://mathfaculty.fullerton.edu/mathews/n2003/montecarlopim
od.html 2005
[15] M. Völske, S. Syed, "Getting Started with Hadoop",
https://www.uni-
weimar.de/fileadmin/user/fak/medien/professuren/Webis/teachi
ng/ss18/big-data-seminar/hadoop-tutorial-frame.pdf May 28,
2018
0
100
200
300
400
search sort
RunTime(sn)
100
500
1000
file size (mb)
International Journal of Modern Research in Engineering and Technology (IJMRET)
www.ijmret.org Volume 3 Issue 6 ǁ June 2018.
w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 37
Figure 7. Pi application with 8 map tasks and 10000 sample
Figure 8.Output of Pi estimate
Figure 9. Grep application
Figure 10. Output of Grep application

More Related Content

DOCX
Seminar Report Vaibhav
PDF
Survey of Parallel Data Processing in Context with MapReduce
PDF
Introduction to Big Data and Hadoop using Local Standalone Mode
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PDF
Survey Paper on Big Data and Hadoop
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Seminar Report Vaibhav
Survey of Parallel Data Processing in Context with MapReduce
Introduction to Big Data and Hadoop using Local Standalone Mode
Harnessing Hadoop and Big Data to Reduce Execution Times
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Big Data Analysis and Its Scheduling Policy – Hadoop
Survey Paper on Big Data and Hadoop
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

What's hot (18)

DOC
PDF
A data aware caching 2415
DOCX
Big data processing using - Hadoop Technology
PDF
Understanding hadoop
DOCX
hadoop seminar training report
DOCX
assignment3
PDF
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
PDF
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
PDF
Web Oriented FIM for large scale dataset using Hadoop
PDF
D04501036040
PDF
Implementation of p pic algorithm in map reduce to handle big data
PDF
Seminar_Report_hadoop
PDF
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
PDF
An efficient data mining framework on hadoop using java persistence api
PDF
A Survey on Big Data Analysis Techniques
PPTX
Big data and hadoop
PDF
Eg4301808811
A data aware caching 2415
Big data processing using - Hadoop Technology
Understanding hadoop
hadoop seminar training report
assignment3
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
Web Oriented FIM for large scale dataset using Hadoop
D04501036040
Implementation of p pic algorithm in map reduce to handle big data
Seminar_Report_hadoop
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
An efficient data mining framework on hadoop using java persistence api
A Survey on Big Data Analysis Techniques
Big data and hadoop
Eg4301808811
Ad

Similar to Hadoop Cluster Analysis and Assessment (20)

PDF
A Survey on Big Data, Hadoop and it’s Ecosystem
PDF
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
PPTX
Hadoop by kamran khan
PDF
Hadoop and its role in Facebook: An Overview
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop
PPTX
Hadoop and Big Data
PDF
Hadoop paper
PDF
G017143640
PPTX
2. hadoop fundamentals
DOCX
Hadoop Seminar Report
PPTX
Bigdata and Hadoop Introduction
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
PPT
PDF
Hadoop introduction
PDF
getFamiliarWithHadoop
PPTX
Bigdata workshop february 2015
PPT
hadoop
PPT
hadoop
A Survey on Big Data, Hadoop and it’s Ecosystem
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
Hadoop by kamran khan
Hadoop and its role in Facebook: An Overview
Introduction to Hadoop and Big Data
Hadoop
Hadoop and Big Data
Hadoop paper
G017143640
2. hadoop fundamentals
Hadoop Seminar Report
Bigdata and Hadoop Introduction
Hadoop_EcoSystem slide by CIDAC India.pptx
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
Hadoop introduction
getFamiliarWithHadoop
Bigdata workshop february 2015
hadoop
hadoop
Ad

More from International Journal of Modern Research in Engineering and Technology (20)

PDF
Numerical Simulations of the Bond Stress-Slip Effect of Reinforced Concrete o...
PDF
Building an integrated vertical chain - a factor for sustainable construction
PDF
Applicability Study on the Optical Remote Sensing Techniques in a River
PDF
There is Always A Better Way: The Argument for Industrial Engineering
PDF
Study on the LandCover Classification using UAV Imagery
PDF
Comparative Analysis between Five Level Conventional and Modified Cascaded H-...
PDF
Cytotoxicity Studies of TiO2/ZnO Nanocomposites on Cervical Cancer Cells
PDF
Investigation of Performance Properties of Graphene Coated Fabrics
PDF
Effects of bagasse ash additive on the physiochemical and biological paramete...
PDF
Production and Analysis of Bioresin From Mango (Mangifera Indica) Kernel Oil
PDF
Particle Swarm Optimization Algorithm Based Window Function Design
PDF
Computed Tomography Image Reconstruction in 3D VoxelSpace
PDF
Antimicrobial Activity of Capsicum Essential Oil of Peppers
PDF
Design of Window Function in LABVIEW Environment
PDF
A study of the temporal flow of passenger and cargo transport in a Brazilian ...
PDF
Determination of Linear Absorption Coefficient for Different Materials
PDF
Evaluation of Naturally Occurring Radionuclide in Soil Samples from Ajiwei Mi...
PDF
Kinematics Modeling and Simulation of SCARA Robot Arm
PDF
Strength and durability assessment of concrete substructure in organic and hy...
Numerical Simulations of the Bond Stress-Slip Effect of Reinforced Concrete o...
Building an integrated vertical chain - a factor for sustainable construction
Applicability Study on the Optical Remote Sensing Techniques in a River
There is Always A Better Way: The Argument for Industrial Engineering
Study on the LandCover Classification using UAV Imagery
Comparative Analysis between Five Level Conventional and Modified Cascaded H-...
Cytotoxicity Studies of TiO2/ZnO Nanocomposites on Cervical Cancer Cells
Investigation of Performance Properties of Graphene Coated Fabrics
Effects of bagasse ash additive on the physiochemical and biological paramete...
Production and Analysis of Bioresin From Mango (Mangifera Indica) Kernel Oil
Particle Swarm Optimization Algorithm Based Window Function Design
Computed Tomography Image Reconstruction in 3D VoxelSpace
Antimicrobial Activity of Capsicum Essential Oil of Peppers
Design of Window Function in LABVIEW Environment
A study of the temporal flow of passenger and cargo transport in a Brazilian ...
Determination of Linear Absorption Coefficient for Different Materials
Evaluation of Naturally Occurring Radionuclide in Soil Samples from Ajiwei Mi...
Kinematics Modeling and Simulation of SCARA Robot Arm
Strength and durability assessment of concrete substructure in organic and hy...

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
composite construction of structures.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Well-logging-methods_new................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Construction Project Organization Group 2.pptx
UNIT 4 Total Quality Management .pptx
Lesson 3_Tessellation.pptx finite Mathematics
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Foundation to blockchain - A guide to Blockchain Tech
CH1 Production IntroductoryConcepts.pptx
composite construction of structures.pdf
OOP with Java - Java Introduction (Basics)
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Well-logging-methods_new................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026

Hadoop Cluster Analysis and Assessment

  • 1. International Journal of Modern Research in Engineering and Technology (IJMRET) www.ijmret.org Volume 3 Issue 6 ǁ June 2018. w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 33 Hadoop Cluster Analysis and Assessment Saliha KESKİN1 , Atilla ERGÜZEN2 1 (Department of Computer Engineering, Kırıkkale University, Turkey) ABSTRACT:Large amount of data are produced daily from various fields such as science, economics, engineering and health. The main challenge of pervasive computing is to store and analyze large amount of data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of different data sizes and number of nodes in the cluster, have been made and their results examined. KEYWORDS -Bigdata, Hadoop, MapReduce, HDFS I. INTRODUCTION We are in a Big Data era where large amounts of data are collected and processed in terabytes or petabytes from various sectors. Big Data refers data that can not be processed or analyzed using conventional methods. These big data, which can not be processed by conventional data processing techniques and stored in the conventional database, lead institutions and organizations to a technology that provides a fast and efficient data processing model. Nowadays, Big Data has become a necessity in many areas. In Chapter 2, Hadoop framework and its some components are examined.In Chapter 3, information about Hadoop cluster setup is given and the applications and results for the cluster test are presented. Finally, the results of the paper is indicated in Chapter 4. II. MATERIALS AND METHODS With the development of technology, hardware prices are falling, but storage costs are increasing as the volume and variety of the data increases. Day-by-day data's level can reach exabytes or even zettabyte. The data from many different sectors such as aeronautics, meteorology, IOT applications, health, distance education and energy sectors are obtained[1–2]. Dataset is growing, and RDBMSs are not suitable for storing and managing this large dataset[3]. New techniques and technologies are developing to store and analyze big data.Until now, scientists have developed various techniques and technologies to collect, cluster, analyze and visualize the big data. Hadoop is one of the famous and powerful Big Data tools. It provides infrastructure and platforms for other specialized Big Data applications.A number of Big Data systems are built on top of Hadoop and have many uses in different areas such as data mining and machine learning [4]. Also, thanks to Hadoop's multi-node structure, the storage system has a stronger structure and more performance is achieved with the Hadoop file system [5]. 2.1. HADOOP Hadoop [6] is an open-source software framework that allows processing of large amounts of data in a distributed computing environment. Rather than expensive equipment and dependency on different systems to store and process data, it allows inexpensive parallel processing with Big Data. Companies need to use terabytes or petabytes of data to understand specific queries and requests from users. Existing tools are insufficient to handle big data sets. Hadoop provided a solution to this large amount of data analysis problem. Hadoop is a convenient tool for solving all the difficulties of big data. It can process large amounts of data in the petabyte range. It provides great convenience with scalability and also ensures continuous operation with high fault tolerance. It provides reliability for the data by creating multiple copies of the data (replication) in different locations (nodes) throughout the cluster.
  • 2. International Journal of Modern Research in Engineering and Technology (IJMRET) www.ijmret.org Volume 3 Issue 6 ǁ June 2018. w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 34 Hadoop has two components, HDFS (Hadoop distributed file system) and Mapreduce.While Mapreduce is used for data processing, HDFS is used for data storage[2]. 2.1.1 MAPREDUCE Today, there are many different hardware architectures that support parallel computing. There are two popular examples: (1) Multi-core computing: A multi-core processor is a processor that contains more than one computer core on a single chip;and (2) Distributed computing: Distributed computing systems enable independent computers to communicate with a common network for parallel computing [7]. Mapreduce [8], is a programming model used to processlarge data sets in cluster by providing parallelization and distribution of calculations based on several processors.It was inspired by the map and reduce functions in the functional programming. In MapReduce tasks, data is handled by these two functions. (Fig 1). In the map phase, Hadoop splits the data into small pieces and distributes them to the slave nodes (datanodes). Therefore, when the map tasks are run in parallel, the data processing load is balanced. To further split the piece, a datanode can reapply the map task.Datanode handles the little problem it gets and returns a list in the form of a binary (key, value) as a result.Reduce phase, the outputs of the map task are collected and combined to form the result of the problem they want to solve[9]. Figure 1. MapReduce diagram [10] MapReduce greatly simplifies a large- scale analysis task on distributed data.It is much easier to process an application than to process it many times, with the MapReduce form being implemented more easily with thousands of machines in the cluster. This feature makes the MapReduce programming model more attractive. The Mapreduce programming model [11] consists of two components that control the work execution process:the central component JobTracker and the distributed components TaskTrackers. When a MapReduce application is sent to Hadoop, the application is delivered to JobTracker. JobTracker does the job scheduling and distributes the job to TaskTrackers. Each TaskTracker performs the given operation and sends information to JobTracker about the progress of the job. 2.1.2. HDFS Hadoop Distributed File System (HDFS), which allows distribution and storage of data between Hadoop clusters, is a file system that provides high speed access to large amounts of data [7]. HDFS architecture adopts a master / slave architecture that includes a single NameNode and all other nodes DataNodes.When the data are stored, they are stored separately as metadata and application data. While the NameNode stores the meta data, the DataNodes do the actual storage jobs. NameNode holds meta data for each file stored in HDFS in the main memory. These metadata includes a list of "stored filenames, corresponding blocks of each file, and Datanodes containing these blocks".For this reason, when a client reads a file, it first communicates with Namenode to get the locations of the data blocks that make up the file, and Namenode directs the client to the Datanode cluster hosting the requested file. The client then communicates directly with Datanode to perform file operations [12] [13]. The files in HDFS are divided into smaller blocks, typically 64 MB block size, and these divided blocks are duplicated and distributed to various Datanodes, as shown in Fig 2. Although HDFS has many similarities with existing distributed file systems, it is different.Compared to traditional distributed file systems, HDFS has two important advantages: (1) has high fault tolerance and is usually developed for low cost hardware.Datanodes send a "signal indicating that it is alive", which is called continuous heartbeat for storage space and malfunction detection. When a hardware or software problem occurs in a DataNode, Namenode detects it and reassigns the job to the other Datanode; (2) It provides effective access to data and is suitable for applications with big data sets.
  • 3. International Journal of Modern Research in Engineering and Technology (IJMRET) www.ijmret.org Volume 3 Issue 6 ǁ June 2018. w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 35 Figure2. HDFS file block structure III. PERFORMANCE EVALUATIONS Using Oracle VirtualBox, four virtual machines have been installed in the Ubuntu operating system, one of which have been master node and the other three have been slave nodes. While slave nodes are 4 GB memory, master node is 8 GB memory, and each nodes are also configured as 20 GB SSDs. Each node has a Java Development Kit jdk.1.8.0 and Hadoop 2.7.2 installed.SSH (Secure Shell) is used to access other nodes in the cluster from the master node.Once the IP settings and Hadoop configurations are done, the Hadoop MapReduce function is run.In this section, Pi calculation and Grep algorithms are used. 3.1 PI CALCULATE The area of the circle segment in the unit square shown in Fig. 3 is π / 4. If N points are randomly selected in the square, it means that approximately N * π / 4 of these points fall into the circle.This program selects random points inside the square and then checks whether the point is inside the circle (x2 + y2 <R2, it is in the circle, x and y points are the coordinates and R is radius of the circle).With the Quasi-Monte-Carlo method, the Pi estimation account is measured in terms of all points (N) and how many of these points are in the circle (M) (Pi = 4 * M / N) [14]. In this application, each map function forms some random points in the square, counting how many points are inside / outside the circle segment.The reduce function adds up all points and points in the circle to calculate the number of Pi. Figure 3. Monte Carlo Pi Calculation [15] The first parameter entered in the application is the map number and the second parameter is the number of samples to be used for each map.Figure 7 shows the operation of the Pi application with 8 map tasks and 10000 samples. 1, 2 and 4 nodes were used for the test of Pi application, respectively. In all tests 8 map tasks and 1000 samples were used. The run times of the executed Pi application are given in Figure 4. According to the results, as the number of nodes increases, the duration of study decreases. Figure4. Pi test on different node numbers We also observed the results of Pi calculation using different map and sample numbers. According to at the working times shown in Fig. 5, it is seen that the working time increases when both the map number increases, and the number of samples run for each map increases. Figure 5. Pi test in different map / sample numbers 0 10 20 30 40 50 1 2 4 Runtime(sn) Numberof Nodes 0 500 1000 1500 1 10 100 1000 10000 RunTime(sn) Numberof Maps 100000000 10000000 Number of Samples
  • 4. International Journal of Modern Research in Engineering and Technology (IJMRET) www.ijmret.org Volume 3 Issue 6 ǁ June 2018. w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 36 3.2. GREP APPLICATION The Grep application, used as a search tool, counts matches of regular expressions in the given input. Each map task outputs lines containing one of the models. The reduce task collects the numbers from the map tasks and produces output in pairs.Figure 9 shows a Grep study that interrogates all expressions that start with "v" in the input. The Grep application has been tested in file sizes of 100 MB, 500 MB and 1 GB. During the operation, first sorting is performed in the document and then the desired expression search is performed.The run times of the Grep application run in Figure 6 are shown. The Grep output of a 100 MB file is shown in Figure 10. According to the test results, as the file size increases, the processing time increases. Figure 6. Execute times of Grep tests IV. CONCLUSION Today, the digital world is growing exponentially.Every second, data that cannot be directly managed or used with traditional data management tools are produced.In this article, we have reviewed the Hadoop architecture, which has several interfaces to create tools and applications for harmonizing data from many different data sources.The Hadoop MapReduce programming model and HDFS are increasingly being used to process and store big data sets. In this study, a four-node hadoop cluster was created using virtualization platform.In these nodes, performance analysis was performed by applying various parameters with the benchmarking tools available under the Hadoop framework.We applied the Grep algorithm and Pi estimation algorithm for different node numbers and different file size and analyzed the results.Adding more nodes resulted in increased performance of the cluster; With a growing number of Data node, Hadoop cluster performance can be improved. Moreover, as the size of the input data increases, the performance of the runtime increases. REFERENCES [1] M. Ünver, E. Erdal, A. Ergüzen, "Big Data Example In Web Based Learning Management Systems", International Journal of Advanced Computational Engineering and Networking, ISSN(p): 2320-2106, ISSN(e): 2321-2063 Volume- 6, Issue-2, Feb.-2018 [2] A. Ergüzen, E. Erdal, "Medical Image Archiving System Implementation with Lossless Region of Interest and Optical Character Recognition", Journal of Medical Imaging and Health Informatics Vol. 7, 1246–1252, 2017 [3] A. Erguzen, E. Erdal, M. Unver,"Big Data Challenges And Opportunities In Distance Education", International Journal of Advanced Computational Engineering and Networking, ISSN(p): 2320-2106, ISSN(e): 2321-2063 Volume-6, Issue-2, Feb.-2018 [4] P. Chen, C-Y Zhang, "Data-intensive applications, challenges, techniques and technologies: A survey on Big Data" Information SciencesVolume 275, Page 314-347,10 August 2014 [5] A. Ergüzen, E. Erdal, "An Efficient Middle Layer Platform for Medical Imaging Archives",Journal of Healthcare Engineering, 9 May 2018 [6] Harness the Power of Big Data - The IBM Big Data Platform, USA 2012, s.87-88 [7] T. Huang, L. Lan, X. Fang, P. An, J. Min, F. Wang "Promises and Challenges of Big Data Computing in Health Sciences" Big Data ResearchVolume 2, Issue 1, March 2015, Pages 2-11 [8] J. Dean, S. Ghemawat "MapReduce: Simplified Data Processing on Large Clusters", Magazine Communications of the ACM,Volume 51 Issue 1, January 2008 Pages 107-113 [9] R. Nanduri, N. Maheshwari, A. Reddyraja, V. Varma "Job Aware Scheduling Algorithm for MapReduce Framework" Cloud Computing Technology and Science (CloudCom), IEEE Third International Conference2011 [10] J. Kun, https://jeremykun.com/2014/10/05/on-the- computational-complexity-of-mapreduce/ October 5, 2014 [11] J. Dhok and V. Varma, “Using pattern classification for task assignment in Mapreduce” in ISEC, 2010 [12] S. J. Andaloussi, A. Sekkaki "Medical Content Based Image Retrieval by Using the Hadoop Framework", Telecommunications (ICT), 2013 20th International Conference, 2013 [13] M. B. Bisane , F. Pushpanjali, M. Chouragade, "Improving Access Efficiency of Small Files in HDFS" International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 [14] J. H. Mathews, "Module for Monte Carlo Pi", http://mathfaculty.fullerton.edu/mathews/n2003/montecarlopim od.html 2005 [15] M. Völske, S. Syed, "Getting Started with Hadoop", https://www.uni- weimar.de/fileadmin/user/fak/medien/professuren/Webis/teachi ng/ss18/big-data-seminar/hadoop-tutorial-frame.pdf May 28, 2018 0 100 200 300 400 search sort RunTime(sn) 100 500 1000 file size (mb)
  • 5. International Journal of Modern Research in Engineering and Technology (IJMRET) www.ijmret.org Volume 3 Issue 6 ǁ June 2018. w w w . i j m r e t . o r g I S S N : 2 4 5 6 - 5 6 2 8 Page 37 Figure 7. Pi application with 8 map tasks and 10000 sample Figure 8.Output of Pi estimate Figure 9. Grep application Figure 10. Output of Grep application