SlideShare a Scribd company logo
1 www.prace-ri.euIntroduction to Hadoop (part 1)
Introduction to Hadoop (part 1)
Vienna Technical University - IT Services (TU.it)
Dr. Giovanna Roda
2 www.prace-ri.euIntroduction to Hadoop (part 1)
What is Big Data?
The large amounts of data that are available nowadays cannot be handled with traditional
technologies. "Big Data" has become the catch-all term for massive amounts of data as well as
for frameworks and R&D initiatives aimed at working with it efficiently.
The “three V’s” which are often used to characterize Big Data:
▶ Volume (the sheer volume of data)
▶ Velocity (rate of flow of the data and processing speed needs)
▶ Variety (different sources and formats)
3 www.prace-ri.euIntroduction to Hadoop (part 1)
The three V‘s of Big Data
Image source: Wikipedia
4 www.prace-ri.euIntroduction to Hadoop (part 1)
The three V‘s of Big Data
Additionally some other characteristics to keep in mind when dealing with Big Data and with
data in general:
▶ Veracity (quality or trustworthiness of data)
▶ Value (economic value of the data)
▶ Variability (general variability in any of the data characteristics)
5 www.prace-ri.euIntroduction to Hadoop (part 1)
Challenges posed by Big Data
Here are some challenges posed by Big Data:
▶ disk and memory space
▶ processing speed
▶ hardware faults
▶ network capacity and speed
▶ optimization of resources usage
In the rest of this seminar we are going to see how Hadoop tackles them.
6 www.prace-ri.euIntroduction to Hadoop (part 1)
Big Data and Hadoop
Apache Hadoop is one of the most widely adopted
frameworks for Big Data processing.
Some facts about Hadoop:
▶ project of the Apache Open Source Foundation
▶ open source
▶ facilitates distributed computing
▶ initially released in 2006. Last version is 3.3.0 (Jul. 2020), stable 3.2.1 (Sept. 2019)
▶ originally inspired by Google‘s MapReduce and the proprietary GFS (Google File System)
7 www.prace-ri.euIntroduction to Hadoop (part 1)
Some Hadoop features explained
▶ fault tolerance: the ability to withstand hardware or network failures
▶ high availability: this refers to the system minimizing downtimes by eliminating single points
of failure
▶ data locality: task are run where the data is located to reduce the costs of moving large
amounts of data around
8 www.prace-ri.euIntroduction to Hadoop (part 1)
How does Hadoop address the challenges of Big Data?
▶ performance: allows processing of large amounts of data through distributed computing
▶ data locality: task are run where the data is located
▶ cost-effectiveness: it runs on commodity hardware
▶ scalability: new and/or more performant hardware can be addedd seamlessly
▶ offers fault tolerance and high availability
▶ good abstraction of the underlying hardware and easy to use
▶ provides SQL and other abstraction frameworks like Hive, Hbase
9 www.prace-ri.euIntroduction to Hadoop (part 1)
The Hadoop core
The core of Hadoop consists of:
▶ Hadoop common, the core libraries
▶ HDFS, the Hadoop Distributed File System
▶ MapReduce
▶ the YARN resource manager (Yet Another Resource Negotiator)
The Hadoop core is written in Java.
10 www.prace-ri.euIntroduction to Hadoop (part 1)
Next:
The core of Hadoop consists of:
▶ Hadoop common, the core libraries
▶ HDFS, the Hadoop Distributed File System
▶ MapReduce
▶ the YARN resource manager (Yet Another Resource Manager)
The Hadoop core is written in Java.
11 www.prace-ri.euIntroduction to Hadoop (part 1)
What is HDFS?
HDFS stands for Hadoop Distributed File System and it’s a filesystem that takes care of
partitioning data across a cluster.
In order to prevent data loss and/or task termination due to hardware failures HDFS uses
replication, that is simply making multiple copies —usually 3— of the data (*).
The feature of HDFS being able to withstand hardware failure is known as fault tolerance or
resilience.
(*) Starting from version 3, Hadoop provides erasure coding as an alternative to replication
12 www.prace-ri.euIntroduction to Hadoop (part 1)
Replication vs. Erasure Coding
In order to provide protection against failures one introduces:
▶ data redundancy
▶ a method to recover the lost data using the redundant data
Replication is the simplest method for coding data by making n copies of the data.
n-fold replication guarantees the availability of data for at most n-1 failures and it has a storage
overhead of 200% (this is equivalent to a storage efficiency of 33%).
Erasure coding provides a better storage efficiency (up to to 71%) but it can be more costly than
replication in terms of performance. See for instance the white paper “Comparing cost and
performance of replication and erasure coding”.
13 www.prace-ri.euIntroduction to Hadoop (part 1)
HDFS architecture
A typical Hadoop cluster installation consists of:
▶ a NameNode
responsible for the bookkeeping of the data partitioned across the DataNodes, managing
the whole filesystem metadata, load balancing,
▶ a secondary NameNode
this is a copy of the NameNode, ready to take over in case of failure of the NameNode.
A secondary NameNode is necessary to guarantee high availability (since the NameNode
is a single point of failure)
▶ multiple DataNodes
here is where the data is saved and the computations take place
14 www.prace-ri.euIntroduction to Hadoop (part 1)
HDFS architecture: internal data representation
HDFS supports working with very large files.
Internally, files are split into blocks. One of the reason for that is that block objects have all
the same size.
The block size in HDFS can be configured at installation time and it is by default 128MB.
15 www.prace-ri.euIntroduction to Hadoop (part 1)
What is HDFS?
16 www.prace-ri.euIntroduction to Hadoop (part 1)
HDFS architecture
Some notes:
▶ one host can act both as NameNode and DataNode. These are just services running on the
nodes
▶ a minimal HDFS cluster should have 3 nodes to support the default replication of 3 (one of
the nodes acts as both NameNode and DataNode)
▶ rebalancing of data nodes is not done automatically but it can be triggered with:
sudo -u hdfs hdfs balancer -threshold 5
(here we balance data on the DataNodes so that load differs by at most 5%, default is
10%)
17 www.prace-ri.euIntroduction to Hadoop (part 1)
Look at the current cluster
We‘re now going to look at the current cluster configuration.
In order to do that we must first:
▶ login to the system
▶ activate the appropriate Hadoop environment
▶ (optional) check environment variables
18 www.prace-ri.euIntroduction to Hadoop (part 1)
Login to the system and activate environment
▶ login to VSCFS and open a terminal. Instructions are in: GUI interface with NoMachine client:
https://wiki.hpc.fs.uni-lj.si/index.php?title=Dostop#Dostop_preko_grafi.C4.8Dnega_vmesnika,
SSH: https://wiki.hpc.fs.uni-lj.si/index.php?title=Dostop#Dostop_preko_SSH
▶ activate Hadoop module
module avail Hadoop # show available Hadoop installations (case-sensitive)
module load Hadoop # this will load the latest Hadoop (2.10)
module list
You should have the following modules 3 activated now:
GCCcore/8.3.0, Java/1.8.0_202, Hadoop/2.10.0-GCCcore-8.3.0-native
19 www.prace-ri.euIntroduction to Hadoop (part 1)
Check environment variables
These variables usually need to be defined before starting to work with Hadoop:
JAVA_HOME, PATH, HADOOP_CLASSPATH, HADOOP_HOME
Check with:
echo $JAVA_HOME
echo $PATH
echo $HADOOP_CLASSPATH
echo $HADOOP_HOME
20 www.prace-ri.euIntroduction to Hadoop (part 1)
Let‘s look at the current Hadoop configuration
▶ what are the NameNode(s)?
hdfs getconf –namenodes
▶ list all DataNodes
yarn node -list -all
▶ block size
hdfs getconf -confKey dfs.blocksize|numfmt --to=iec
▶ replication factor
hdfs getconf -confKey dfs.replication
21 www.prace-ri.euIntroduction to Hadoop (part 1)
Let‘s look at the current Hadoop configuration
▶ how much disk space is available on the whole cluster?
hdfs getconf –namenodes
▶ list all DataNodes
yarn node –list -all
yarn node -list –showDetails # show more details for each host
▶ block size
hdfs getconf -confKey dfs.blocksize|numfmt --to=iec
▶ replication factor
hdfs getconf -confKey dfs.replication
22 www.prace-ri.euIntroduction to Hadoop (part 1)
Basic HDFS filesystem commands
You can regard HDFS as a regular file system, in fact many HDFS shell commands are
inherited from the corresponding bash commands. Here’s three basic commands that are
specific to HDFS.
command description
hadoop fs –put
hdfs dfs –put
Copy single src, or multiple srcs from
local file system to the destination file
system
hadoop fs -get
hdfs dfs -get
Copy files to the local file system
hadoop fs -usage
hdfs dfs -usage
get help on hadoop fs
23 www.prace-ri.euIntroduction to Hadoop (part 1)
Basic HDFS filesystem commands
Notes
1. You can use interchangeably hadoop or hdfs dfs when working on a HDFS file system.
The command hadoop is more generic because it can be used not only on HDFS but also
on other file systems that Hadoop supports (such as Local FS, WebHDFS, S3 FS, and
others).
2. The full list of Hadoop filesystem shell commands can be found here:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
common/FileSystemShell.html
24 www.prace-ri.euIntroduction to Hadoop (part 1)
Basic HDFS filesystem commands that also exist in bash
bash HDFS description
mkdir
hadoop fs -mkdir
hdfs dfs -mkdir
create a directory
ls
hadoop fs -ls
hdfs dfs -ls
list files
cp
hadoop fs -cp
hdfs dfs -cp
copy files
mv
hadoop fs -mv
hdfs dfs -mv
move files
cat
hadoop fs -cat
hdfs dfs -cat
concatenate files and
print them to standard
output
rm
hadoop fs -rm
hdfs dfs -rm
remove files
25 www.prace-ri.euIntroduction to Hadoop (part 1)
Exercise: basic HDFS commands
1. List your HDFS home with: hdfs dfs ls
2. create a directory small_data in your HDFS share (use relative path, so the directory will
be created in your HDFS home)
3. Upload a file from the local filesystem to small_data. If you don’t have a file, use
/home/campus00/public/data/fruits.txt
4. list the contents of small_data on HDFS to check that the file is there
5. compare the space required to save your file in the local filesystem and on HDFS using du
6. clean up: (use hadoop rm –r small_data)
26 www.prace-ri.euIntroduction to Hadoop (part 1)
Next:
The core of Hadoop consists of:
▶ Hadoop common, the core libraries
▶ HDFS, the Hadoop Distributed File System
▶ MapReduce
▶ the YARN resource manager (Yet Another Resource Manager)
The Hadoop core is written in Java.
27 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: the origins
The seminal article on MapReduce is: “MapReduce: Simplified Data Processing on Large
Clusters” by Jeffrey Dean and Sanjay Ghemawat from 2004.
In this paper, the authors (members of the Google research team) describe the methods used
to split, process, and aggregate the large amount of data at the basis of the Google search
engine.
28 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce explained
Source: Stackoverflow
29 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce explained
The phases of a MapReduce job
1. split: data is partitioned across several computer nodes
2. map: apply a map function to each chunk of data
3. sort & shuffle: the output of the mappers is sorted and distributed to the reducers
4. reduce: finally, a reduce function is applied to the data and an output is produced
Notes
▶ Note 1: the same map (and reduce) function is applied to all the chunks in the data.
▶ Note 2: the map and reduce computations can be carried out in parallel because they’re
completely independent from one another.
▶ Note 3: the split is not the same as the internal partitioning into blocks
30 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce explained
.
31 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce explained
.
32 www.prace-ri.euIntroduction to Hadoop (part 1)
Look at some Mapreduce configuration values
▶ default number of map tasks
hdfs getconf -confKey mapreduce.job.maps
▶ default number of reduce tasks
hdfs getconf -confKey mapreduce.job.reduces
▶ the total amount of memory buffer (in MB) to use when sorting files
hdfs getconf -confKey mapreduce.task.io.sort.mb
33 www.prace-ri.euIntroduction to Hadoop (part 1)
A simple example
We‘re now going to run a simple example on the cluster using MapReduce and the Hadoop‘s
streaming library.
Check/set environment variables
These variables usually need to be defined before starting to work with Hadoop:
JAVA_HOME, PATH, HADOOP_CLASSPATH
echo JAVA_HOME
echo PATH
echo HADOOP_CLASSPATH
34 www.prace-ri.euIntroduction to Hadoop (part 1)
Hadoop streaming
The mapreduce streaming library allows to use any executable as mappers and reducers.
The requirements on the mapper and reducer executables is that they are able to:
▶ read the input from stdin (line by line) and
▶ emit the output to stdout.
Here’s the documentation for streaming: https://hadoop.apache.org/docs/r2.10.0/hadoop-
streaming/HadoopStreaming.html
35 www.prace-ri.euIntroduction to Hadoop (part 1)
Hadoop streaming
To start a MapReduce streaming job use:
hadoop jar hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper mapper_executable
-reducer reducer_executable
Note 1: in the input files and output files are located by default on HDFS
Note 2: in Hadoop 3 you can use the shorter command mapred streaming -input … in place of
hadoop jar … to launch a MapReduce streaming job.
36 www.prace-ri.euIntroduction to Hadoop (part 1)
Find the streaming library path
Use one of the following commands to find the location of the streaming library:
which hadoop
echo $HADOOP_HOME
alternatives --display hadoop
find /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-native/ -name
'hadoop-streaming*.jar’
export STREAMING_PATH=/opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-
native/share/hadoop/tools/lib
37 www.prace-ri.euIntroduction to Hadoop (part 1)
Run a simple MapReduce job
We are going to use two bash commands as executables:
export STREAMING_PATH=/opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-
native/share/hadoop/tools/lib
hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar 
-input data/wiki321MB 
-output simplest-out 
-mapper /usr/bin/cat 
-reducer /usr/bin/wc
38 www.prace-ri.euIntroduction to Hadoop (part 1)
Run a simple MapReduce job – look at the output
What did we just do?
▶ the mapper just outputs the input as it is
▶ the reducer counts the lines in the input text
Where to find the output?
It is in simplest-out
Since the output folder is on HDFS, so we list it with
hdfs dfs –ls simplest-out
If the output folder contains a file named _SUCCESS, then the job was successful. The actual output is in
the file(s) part-*. Look at the output with hdfs dfs –cat simplest-mr-job/part-* and compare
with the output of wc on the local file.
39 www.prace-ri.euIntroduction to Hadoop (part 1)
Run a simple MapReduce job – look at the loggging messages
Let look at some of the logging messages of MapReduce:
▶ how many mappers were executed?
MapReduce automatically sets the number of map tasks according to the size of the
input (in blocks). The minimum number of map tasks is determined by
mapreduce.job.maps
▶ how many reducers?
40 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count
Wordcount is the classic MapReduce application. We are going to implement in in Python
and learn some more about MapReduce along the way.
We need to create two scripts:
▶ mapper.py
▶ reducer.py
41 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – the mapper
The mapper needs to be able to read from input and emit a series of <key, value> pairs.
By default, the format of each mapper’s output line is a tab-separated string where anything
preceding the tab is regarded as the key.
#!/bin/python3
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print("{}t{}".format(word,1))
42 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – test the mapper
Test the mapper on a local file, for instance wiki321MB
FILE=/home/campus00/hadoop/data/wiki321MB
chmod 755 mapper.py
head -2 $FILE | ./mapper.py | less
If everything went well, you should get an output that looks like this:
8753 1
Dharma 1
. . .
43 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count –the reducer
#!/bin/python3
import sys
current_word, current_count = None, 0
for line in sys.stdin:
word, count = line.strip().split('t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print("{}t{}".format(current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print("{}t{}".format(current_word, current_count))
44 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – test the reducer
Test the reducer on a local file, for instance wiki321MB (as an alternative use
fruits.txt or any other text file).
FILE=/home/campus00/hadoop/data/wiki321MB
chmod 755 reducer.py
head -2 $FILE | ./mapper.py | sort | ./reducer.py | less
If everything went well, you should get an output that looks like this:
"Brains, 1
"Chapter 1
. . .
45 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – sort the output
The reducer is emitting an output sorted by key (in this case keys are words).
To view the output sorted by count use:
head -2 $FILE | ./mapper.py | sort | ./reducer.py | sort -k2nr | head
Note: sort -k2nr sorts numerically by the second field in reverse order.
What is the most frequent word?
46 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – Hadoop it up
Now that we’ve tested our mapper and reducer we’re ready to run the job on the cluster.
We need to tell Hadoop to upload out mapper and reducer code to the datanodes by using
the option -file (here you can find all Hadoop streaming options).
We just need two more steps:
▶ prepare a folder wordcount_out on HDFS where the output will be written
hdfs dfs –mkdir wordcount_out
▶ upload the data to HDFS
echo $FILE # /home/campus00/hadoop/data/wiki321MB
hdfs dfs -put $FILE
47 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – Hadoop it up
Finally, start the MapReduce job.
hadoop jar ${STREAMING_JAR}/hadoop-streaming-2.10.0.jar 
-file mapper.py 
-file reducer.py 
-input wiki321MB 
-output wordcount_out 
-mapper mapper.py 
-reducer reducer.py
48 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – check the output
▶ does the output folder contain a file named _SUCCESS?
▶ check the output with
hdfs dfs -cat wordcount_out/part-00000 |head
Of course we would like to have the output sorted by value (the word frequency).
Let us just execute another MapReduce job acting on our output file.
49 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – second transformation
The second transformation will use a mapper that swaps key and value and no reducer.
Let us also filter for words with frequency >100.
Mapper is a shell script. Call it swap_keyval.sh.
#!/bin/bash
while read key val
do
if (( val > 100 )); then
printf "%st%sn" "$val" "$key"
fi
done
50 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: word count – second transformation - run
Run MapReduce job
hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar 
-file swap_keyval.sh 
-input wordcount_out 
-output wordcount_out2
Check the output. Does it look right?
51 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: configure sort with KeyFieldBasedComparator
Map output by default is sorted in ascending order by key.
We can control how a mapper is going to sort by configuring the comparator directive to use
the special class KeyFieldBasedComparator.
This class has some options similar to the Unix sort (-n to sort numerically, -r for
reverse sorting, -k pos1[,pos2] for specifying fields to sort by).
52 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce: configure sort with KeyFieldBasedComparator
COMPARATOR=org.apache.hadoop.mapred.lib.KeyFieldBasedCompara
tor
hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar 
-D mapreduce.job.output.key.comparator.class=$COMPARATOR 
-D mapreduce.partition.keycomparator.options=-nr 
-file swap_keyval.sh 
-input wordcount_out 
-output wordcount_out3
53 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce exercise: change number of mappers or reducers
Map Try to change the number of map and/or reduce tasks using the options
(here we use for instance 10 map tasks and 2 reduce tasks):
-D mapred.map.tasks=10
-D mapred.reduce.tasks=2
Does this improve the performance of your MapReduce job? How does the job
output look like?
54 www.prace-ri.euIntroduction to Hadoop (part 1)
MapReduce patterns
What can MapReduce be used for? Data summarization by grouping, filtering, joining, etc. We
have already seen the filtering pattern in our second transformation for wordcount:
For more patterns see: “MapReduce Design Patterns” , O’Reilly
55 www.prace-ri.euIntroduction to Hadoop (part 1)
Running Hadoop on your pc
Hadoop can run on a single node (of course in this case no replication is permitted):
Hadoop: Setting up a Single Node Cluster). In this case, no Yarn is needed.
By default, Hadoop is configured to run in a non-distributed mode as a single Java process. In a
single node setup, it’s possible to allow pseudo-distributed operation by allowing each Hadoop
deamon to run as separate Java process.
It is also possible to use MapReduce without HDFS, using the local filesystem.
56 www.prace-ri.euIntroduction to Hadoop (part 1)
Why learn about HDFS and MapReduce?
HDFS and MapReduce are part of most Big Data curricula even though one ultimately will
probably use higher level frameworks for working with Big Data.
One of the main limitations of Mapreduce is the disk I/O. The successor of Mapreduce, Apache
Spark, offers better performance by orders of magnitude thanks to its in-memory processing.
The Spark engine does not need HDFS.
A higher level framework like Hive allows to access data using HQL (Hive Query Language), a
language similar to SQL.
Still, learning HDFS & MapReduce is useful because they exemplify in a simple way the
foundational issues of the Hadoop approach to distributed computing.
57 www.prace-ri.euIntroduction to Hadoop (part 1)
Recap
▶ what is Big Data?
▶ what is Apache Hadoop?
▶ some Hadoop features explained
▶ Architecture of a Hadoop cluster
▶ HDFS, the Hadoop Distributed File System
▶ the MapReduce framework
▶ MapReduce streaming
▶ the simplest MapReduce job
▶ MapReduce wordcount
▶ using the Hadoop comparator class
▶ MapReduce design patterns
▶ why learn HDFS and MapReduce?
▶ can I run Hadoop on my pc?
58 www.prace-ri.euIntroduction to Hadoop (part 1)
THANK YOU FOR YOUR ATTENTION
www.prace-ri.eu

More Related Content

PPTX
Introduction to Hadoop part 2
ODP
Hadoop seminar
PDF
Introduction to Hadoop
PPTX
Big data Hadoop presentation
PPTX
Apache hadoop technology : Beginners
PPTX
Hadoop technology
PPTX
Hadoop Presentation - PPT
PPTX
Big data and Hadoop
Introduction to Hadoop part 2
Hadoop seminar
Introduction to Hadoop
Big data Hadoop presentation
Apache hadoop technology : Beginners
Hadoop technology
Hadoop Presentation - PPT
Big data and Hadoop

What's hot (20)

PPSX
PPTX
Apache Hadoop
DOCX
Hadoop Seminar Report
PPT
Seminar Presentation Hadoop
PPTX
Introduction to Hadoop Technology
PPT
Hadoop hive presentation
PPTX
HADOOP TECHNOLOGY ppt
PDF
Hadoop installation, Configuration, and Mapreduce program
PDF
Apache Hadoop - Big Data Engineering
PPT
Hadoop Technologies
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
PPT on Hadoop
PDF
Introduction to Hadoop
PPTX
002 Introduction to hadoop v3
PPTX
Introduction to Big Data and Hadoop
PPTX
Hadoop
PPTX
Big data and hadoop
ODP
Hadoop - Overview
PPTX
Hadoop: Distributed Data Processing
Apache Hadoop
Hadoop Seminar Report
Seminar Presentation Hadoop
Introduction to Hadoop Technology
Hadoop hive presentation
HADOOP TECHNOLOGY ppt
Hadoop installation, Configuration, and Mapreduce program
Apache Hadoop - Big Data Engineering
Hadoop Technologies
Introduction to Big Data & Hadoop Architecture - Module 1
PPT on Hadoop
Introduction to Hadoop
002 Introduction to hadoop v3
Introduction to Big Data and Hadoop
Hadoop
Big data and hadoop
Hadoop - Overview
Hadoop: Distributed Data Processing
Ad

Similar to Introduction to Hadoop part1 (20)

PPTX
Bigdata and Hadoop Introduction
PDF
Hadoop architecture-tutorial
PPTX
Hadoop basics
PPT
hadoop
PPT
hadoop
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Hadoop project design and a usecase
DOCX
500 data engineering interview question.docx
PPTX
Seminar ppt
PPTX
Managing Big data with Hadoop
PDF
Big data overview of apache hadoop
PDF
Big data overview of apache hadoop
PPTX
Schedulers optimization to handle multiple jobs in hadoop cluster
PPTX
Hadoop and BigData - July 2016
PDF
PDF
Hadoop Distributed File System in Big data
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
Bigdata and Hadoop Introduction
Hadoop architecture-tutorial
Hadoop basics
hadoop
hadoop
Cppt Hadoop
Hadoop project design and a usecase
500 data engineering interview question.docx
Seminar ppt
Managing Big data with Hadoop
Big data overview of apache hadoop
Big data overview of apache hadoop
Schedulers optimization to handle multiple jobs in hadoop cluster
Hadoop and BigData - July 2016
Hadoop Distributed File System in Big data
Top Hadoop Big Data Interview Questions and Answers for Fresher
Ad

More from Giovanna Roda (6)

PDF
Distributed Computing for Everyone
PPT
The need for new paradigms in IT services provisioning
PPTX
Apache Spark™ is here to stay
PDF
Chances and Challenges in Comparing Cross-Language Retrieval Tools
PDF
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
PDF
Patent Search: An important new test bed for IR
Distributed Computing for Everyone
The need for new paradigms in IT services provisioning
Apache Spark™ is here to stay
Chances and Challenges in Comparing Cross-Language Retrieval Tools
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
Patent Search: An important new test bed for IR

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Lecture1 pattern recognition............
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Foundation of Data Science unit number two notes
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Lecture1 pattern recognition............
Introduction-to-Cloud-ComputingFinal.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Foundation of Data Science unit number two notes
.pdf is not working space design for the following data for the following dat...
IBA_Chapter_11_Slides_Final_Accessible.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Introduction to Hadoop part1

  • 1. 1 www.prace-ri.euIntroduction to Hadoop (part 1) Introduction to Hadoop (part 1) Vienna Technical University - IT Services (TU.it) Dr. Giovanna Roda
  • 2. 2 www.prace-ri.euIntroduction to Hadoop (part 1) What is Big Data? The large amounts of data that are available nowadays cannot be handled with traditional technologies. "Big Data" has become the catch-all term for massive amounts of data as well as for frameworks and R&D initiatives aimed at working with it efficiently. The “three V’s” which are often used to characterize Big Data: ▶ Volume (the sheer volume of data) ▶ Velocity (rate of flow of the data and processing speed needs) ▶ Variety (different sources and formats)
  • 3. 3 www.prace-ri.euIntroduction to Hadoop (part 1) The three V‘s of Big Data Image source: Wikipedia
  • 4. 4 www.prace-ri.euIntroduction to Hadoop (part 1) The three V‘s of Big Data Additionally some other characteristics to keep in mind when dealing with Big Data and with data in general: ▶ Veracity (quality or trustworthiness of data) ▶ Value (economic value of the data) ▶ Variability (general variability in any of the data characteristics)
  • 5. 5 www.prace-ri.euIntroduction to Hadoop (part 1) Challenges posed by Big Data Here are some challenges posed by Big Data: ▶ disk and memory space ▶ processing speed ▶ hardware faults ▶ network capacity and speed ▶ optimization of resources usage In the rest of this seminar we are going to see how Hadoop tackles them.
  • 6. 6 www.prace-ri.euIntroduction to Hadoop (part 1) Big Data and Hadoop Apache Hadoop is one of the most widely adopted frameworks for Big Data processing. Some facts about Hadoop: ▶ project of the Apache Open Source Foundation ▶ open source ▶ facilitates distributed computing ▶ initially released in 2006. Last version is 3.3.0 (Jul. 2020), stable 3.2.1 (Sept. 2019) ▶ originally inspired by Google‘s MapReduce and the proprietary GFS (Google File System)
  • 7. 7 www.prace-ri.euIntroduction to Hadoop (part 1) Some Hadoop features explained ▶ fault tolerance: the ability to withstand hardware or network failures ▶ high availability: this refers to the system minimizing downtimes by eliminating single points of failure ▶ data locality: task are run where the data is located to reduce the costs of moving large amounts of data around
  • 8. 8 www.prace-ri.euIntroduction to Hadoop (part 1) How does Hadoop address the challenges of Big Data? ▶ performance: allows processing of large amounts of data through distributed computing ▶ data locality: task are run where the data is located ▶ cost-effectiveness: it runs on commodity hardware ▶ scalability: new and/or more performant hardware can be addedd seamlessly ▶ offers fault tolerance and high availability ▶ good abstraction of the underlying hardware and easy to use ▶ provides SQL and other abstraction frameworks like Hive, Hbase
  • 9. 9 www.prace-ri.euIntroduction to Hadoop (part 1) The Hadoop core The core of Hadoop consists of: ▶ Hadoop common, the core libraries ▶ HDFS, the Hadoop Distributed File System ▶ MapReduce ▶ the YARN resource manager (Yet Another Resource Negotiator) The Hadoop core is written in Java.
  • 10. 10 www.prace-ri.euIntroduction to Hadoop (part 1) Next: The core of Hadoop consists of: ▶ Hadoop common, the core libraries ▶ HDFS, the Hadoop Distributed File System ▶ MapReduce ▶ the YARN resource manager (Yet Another Resource Manager) The Hadoop core is written in Java.
  • 11. 11 www.prace-ri.euIntroduction to Hadoop (part 1) What is HDFS? HDFS stands for Hadoop Distributed File System and it’s a filesystem that takes care of partitioning data across a cluster. In order to prevent data loss and/or task termination due to hardware failures HDFS uses replication, that is simply making multiple copies —usually 3— of the data (*). The feature of HDFS being able to withstand hardware failure is known as fault tolerance or resilience. (*) Starting from version 3, Hadoop provides erasure coding as an alternative to replication
  • 12. 12 www.prace-ri.euIntroduction to Hadoop (part 1) Replication vs. Erasure Coding In order to provide protection against failures one introduces: ▶ data redundancy ▶ a method to recover the lost data using the redundant data Replication is the simplest method for coding data by making n copies of the data. n-fold replication guarantees the availability of data for at most n-1 failures and it has a storage overhead of 200% (this is equivalent to a storage efficiency of 33%). Erasure coding provides a better storage efficiency (up to to 71%) but it can be more costly than replication in terms of performance. See for instance the white paper “Comparing cost and performance of replication and erasure coding”.
  • 13. 13 www.prace-ri.euIntroduction to Hadoop (part 1) HDFS architecture A typical Hadoop cluster installation consists of: ▶ a NameNode responsible for the bookkeeping of the data partitioned across the DataNodes, managing the whole filesystem metadata, load balancing, ▶ a secondary NameNode this is a copy of the NameNode, ready to take over in case of failure of the NameNode. A secondary NameNode is necessary to guarantee high availability (since the NameNode is a single point of failure) ▶ multiple DataNodes here is where the data is saved and the computations take place
  • 14. 14 www.prace-ri.euIntroduction to Hadoop (part 1) HDFS architecture: internal data representation HDFS supports working with very large files. Internally, files are split into blocks. One of the reason for that is that block objects have all the same size. The block size in HDFS can be configured at installation time and it is by default 128MB.
  • 15. 15 www.prace-ri.euIntroduction to Hadoop (part 1) What is HDFS?
  • 16. 16 www.prace-ri.euIntroduction to Hadoop (part 1) HDFS architecture Some notes: ▶ one host can act both as NameNode and DataNode. These are just services running on the nodes ▶ a minimal HDFS cluster should have 3 nodes to support the default replication of 3 (one of the nodes acts as both NameNode and DataNode) ▶ rebalancing of data nodes is not done automatically but it can be triggered with: sudo -u hdfs hdfs balancer -threshold 5 (here we balance data on the DataNodes so that load differs by at most 5%, default is 10%)
  • 17. 17 www.prace-ri.euIntroduction to Hadoop (part 1) Look at the current cluster We‘re now going to look at the current cluster configuration. In order to do that we must first: ▶ login to the system ▶ activate the appropriate Hadoop environment ▶ (optional) check environment variables
  • 18. 18 www.prace-ri.euIntroduction to Hadoop (part 1) Login to the system and activate environment ▶ login to VSCFS and open a terminal. Instructions are in: GUI interface with NoMachine client: https://wiki.hpc.fs.uni-lj.si/index.php?title=Dostop#Dostop_preko_grafi.C4.8Dnega_vmesnika, SSH: https://wiki.hpc.fs.uni-lj.si/index.php?title=Dostop#Dostop_preko_SSH ▶ activate Hadoop module module avail Hadoop # show available Hadoop installations (case-sensitive) module load Hadoop # this will load the latest Hadoop (2.10) module list You should have the following modules 3 activated now: GCCcore/8.3.0, Java/1.8.0_202, Hadoop/2.10.0-GCCcore-8.3.0-native
  • 19. 19 www.prace-ri.euIntroduction to Hadoop (part 1) Check environment variables These variables usually need to be defined before starting to work with Hadoop: JAVA_HOME, PATH, HADOOP_CLASSPATH, HADOOP_HOME Check with: echo $JAVA_HOME echo $PATH echo $HADOOP_CLASSPATH echo $HADOOP_HOME
  • 20. 20 www.prace-ri.euIntroduction to Hadoop (part 1) Let‘s look at the current Hadoop configuration ▶ what are the NameNode(s)? hdfs getconf –namenodes ▶ list all DataNodes yarn node -list -all ▶ block size hdfs getconf -confKey dfs.blocksize|numfmt --to=iec ▶ replication factor hdfs getconf -confKey dfs.replication
  • 21. 21 www.prace-ri.euIntroduction to Hadoop (part 1) Let‘s look at the current Hadoop configuration ▶ how much disk space is available on the whole cluster? hdfs getconf –namenodes ▶ list all DataNodes yarn node –list -all yarn node -list –showDetails # show more details for each host ▶ block size hdfs getconf -confKey dfs.blocksize|numfmt --to=iec ▶ replication factor hdfs getconf -confKey dfs.replication
  • 22. 22 www.prace-ri.euIntroduction to Hadoop (part 1) Basic HDFS filesystem commands You can regard HDFS as a regular file system, in fact many HDFS shell commands are inherited from the corresponding bash commands. Here’s three basic commands that are specific to HDFS. command description hadoop fs –put hdfs dfs –put Copy single src, or multiple srcs from local file system to the destination file system hadoop fs -get hdfs dfs -get Copy files to the local file system hadoop fs -usage hdfs dfs -usage get help on hadoop fs
  • 23. 23 www.prace-ri.euIntroduction to Hadoop (part 1) Basic HDFS filesystem commands Notes 1. You can use interchangeably hadoop or hdfs dfs when working on a HDFS file system. The command hadoop is more generic because it can be used not only on HDFS but also on other file systems that Hadoop supports (such as Local FS, WebHDFS, S3 FS, and others). 2. The full list of Hadoop filesystem shell commands can be found here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop- common/FileSystemShell.html
  • 24. 24 www.prace-ri.euIntroduction to Hadoop (part 1) Basic HDFS filesystem commands that also exist in bash bash HDFS description mkdir hadoop fs -mkdir hdfs dfs -mkdir create a directory ls hadoop fs -ls hdfs dfs -ls list files cp hadoop fs -cp hdfs dfs -cp copy files mv hadoop fs -mv hdfs dfs -mv move files cat hadoop fs -cat hdfs dfs -cat concatenate files and print them to standard output rm hadoop fs -rm hdfs dfs -rm remove files
  • 25. 25 www.prace-ri.euIntroduction to Hadoop (part 1) Exercise: basic HDFS commands 1. List your HDFS home with: hdfs dfs ls 2. create a directory small_data in your HDFS share (use relative path, so the directory will be created in your HDFS home) 3. Upload a file from the local filesystem to small_data. If you don’t have a file, use /home/campus00/public/data/fruits.txt 4. list the contents of small_data on HDFS to check that the file is there 5. compare the space required to save your file in the local filesystem and on HDFS using du 6. clean up: (use hadoop rm –r small_data)
  • 26. 26 www.prace-ri.euIntroduction to Hadoop (part 1) Next: The core of Hadoop consists of: ▶ Hadoop common, the core libraries ▶ HDFS, the Hadoop Distributed File System ▶ MapReduce ▶ the YARN resource manager (Yet Another Resource Manager) The Hadoop core is written in Java.
  • 27. 27 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: the origins The seminal article on MapReduce is: “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat from 2004. In this paper, the authors (members of the Google research team) describe the methods used to split, process, and aggregate the large amount of data at the basis of the Google search engine.
  • 28. 28 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce explained Source: Stackoverflow
  • 29. 29 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce explained The phases of a MapReduce job 1. split: data is partitioned across several computer nodes 2. map: apply a map function to each chunk of data 3. sort & shuffle: the output of the mappers is sorted and distributed to the reducers 4. reduce: finally, a reduce function is applied to the data and an output is produced Notes ▶ Note 1: the same map (and reduce) function is applied to all the chunks in the data. ▶ Note 2: the map and reduce computations can be carried out in parallel because they’re completely independent from one another. ▶ Note 3: the split is not the same as the internal partitioning into blocks
  • 30. 30 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce explained .
  • 31. 31 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce explained .
  • 32. 32 www.prace-ri.euIntroduction to Hadoop (part 1) Look at some Mapreduce configuration values ▶ default number of map tasks hdfs getconf -confKey mapreduce.job.maps ▶ default number of reduce tasks hdfs getconf -confKey mapreduce.job.reduces ▶ the total amount of memory buffer (in MB) to use when sorting files hdfs getconf -confKey mapreduce.task.io.sort.mb
  • 33. 33 www.prace-ri.euIntroduction to Hadoop (part 1) A simple example We‘re now going to run a simple example on the cluster using MapReduce and the Hadoop‘s streaming library. Check/set environment variables These variables usually need to be defined before starting to work with Hadoop: JAVA_HOME, PATH, HADOOP_CLASSPATH echo JAVA_HOME echo PATH echo HADOOP_CLASSPATH
  • 34. 34 www.prace-ri.euIntroduction to Hadoop (part 1) Hadoop streaming The mapreduce streaming library allows to use any executable as mappers and reducers. The requirements on the mapper and reducer executables is that they are able to: ▶ read the input from stdin (line by line) and ▶ emit the output to stdout. Here’s the documentation for streaming: https://hadoop.apache.org/docs/r2.10.0/hadoop- streaming/HadoopStreaming.html
  • 35. 35 www.prace-ri.euIntroduction to Hadoop (part 1) Hadoop streaming To start a MapReduce streaming job use: hadoop jar hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper mapper_executable -reducer reducer_executable Note 1: in the input files and output files are located by default on HDFS Note 2: in Hadoop 3 you can use the shorter command mapred streaming -input … in place of hadoop jar … to launch a MapReduce streaming job.
  • 36. 36 www.prace-ri.euIntroduction to Hadoop (part 1) Find the streaming library path Use one of the following commands to find the location of the streaming library: which hadoop echo $HADOOP_HOME alternatives --display hadoop find /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-native/ -name 'hadoop-streaming*.jar’ export STREAMING_PATH=/opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0- native/share/hadoop/tools/lib
  • 37. 37 www.prace-ri.euIntroduction to Hadoop (part 1) Run a simple MapReduce job We are going to use two bash commands as executables: export STREAMING_PATH=/opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0- native/share/hadoop/tools/lib hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar -input data/wiki321MB -output simplest-out -mapper /usr/bin/cat -reducer /usr/bin/wc
  • 38. 38 www.prace-ri.euIntroduction to Hadoop (part 1) Run a simple MapReduce job – look at the output What did we just do? ▶ the mapper just outputs the input as it is ▶ the reducer counts the lines in the input text Where to find the output? It is in simplest-out Since the output folder is on HDFS, so we list it with hdfs dfs –ls simplest-out If the output folder contains a file named _SUCCESS, then the job was successful. The actual output is in the file(s) part-*. Look at the output with hdfs dfs –cat simplest-mr-job/part-* and compare with the output of wc on the local file.
  • 39. 39 www.prace-ri.euIntroduction to Hadoop (part 1) Run a simple MapReduce job – look at the loggging messages Let look at some of the logging messages of MapReduce: ▶ how many mappers were executed? MapReduce automatically sets the number of map tasks according to the size of the input (in blocks). The minimum number of map tasks is determined by mapreduce.job.maps ▶ how many reducers?
  • 40. 40 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count Wordcount is the classic MapReduce application. We are going to implement in in Python and learn some more about MapReduce along the way. We need to create two scripts: ▶ mapper.py ▶ reducer.py
  • 41. 41 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – the mapper The mapper needs to be able to read from input and emit a series of <key, value> pairs. By default, the format of each mapper’s output line is a tab-separated string where anything preceding the tab is regarded as the key. #!/bin/python3 import sys for line in sys.stdin: words = line.strip().split() for word in words: print("{}t{}".format(word,1))
  • 42. 42 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – test the mapper Test the mapper on a local file, for instance wiki321MB FILE=/home/campus00/hadoop/data/wiki321MB chmod 755 mapper.py head -2 $FILE | ./mapper.py | less If everything went well, you should get an output that looks like this: 8753 1 Dharma 1 . . .
  • 43. 43 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count –the reducer #!/bin/python3 import sys current_word, current_count = None, 0 for line in sys.stdin: word, count = line.strip().split('t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print("{}t{}".format(current_word, current_count)) current_count = count current_word = word if current_word == word: print("{}t{}".format(current_word, current_count))
  • 44. 44 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – test the reducer Test the reducer on a local file, for instance wiki321MB (as an alternative use fruits.txt or any other text file). FILE=/home/campus00/hadoop/data/wiki321MB chmod 755 reducer.py head -2 $FILE | ./mapper.py | sort | ./reducer.py | less If everything went well, you should get an output that looks like this: "Brains, 1 "Chapter 1 . . .
  • 45. 45 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – sort the output The reducer is emitting an output sorted by key (in this case keys are words). To view the output sorted by count use: head -2 $FILE | ./mapper.py | sort | ./reducer.py | sort -k2nr | head Note: sort -k2nr sorts numerically by the second field in reverse order. What is the most frequent word?
  • 46. 46 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – Hadoop it up Now that we’ve tested our mapper and reducer we’re ready to run the job on the cluster. We need to tell Hadoop to upload out mapper and reducer code to the datanodes by using the option -file (here you can find all Hadoop streaming options). We just need two more steps: ▶ prepare a folder wordcount_out on HDFS where the output will be written hdfs dfs –mkdir wordcount_out ▶ upload the data to HDFS echo $FILE # /home/campus00/hadoop/data/wiki321MB hdfs dfs -put $FILE
  • 47. 47 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – Hadoop it up Finally, start the MapReduce job. hadoop jar ${STREAMING_JAR}/hadoop-streaming-2.10.0.jar -file mapper.py -file reducer.py -input wiki321MB -output wordcount_out -mapper mapper.py -reducer reducer.py
  • 48. 48 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – check the output ▶ does the output folder contain a file named _SUCCESS? ▶ check the output with hdfs dfs -cat wordcount_out/part-00000 |head Of course we would like to have the output sorted by value (the word frequency). Let us just execute another MapReduce job acting on our output file.
  • 49. 49 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – second transformation The second transformation will use a mapper that swaps key and value and no reducer. Let us also filter for words with frequency >100. Mapper is a shell script. Call it swap_keyval.sh. #!/bin/bash while read key val do if (( val > 100 )); then printf "%st%sn" "$val" "$key" fi done
  • 50. 50 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: word count – second transformation - run Run MapReduce job hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar -file swap_keyval.sh -input wordcount_out -output wordcount_out2 Check the output. Does it look right?
  • 51. 51 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: configure sort with KeyFieldBasedComparator Map output by default is sorted in ascending order by key. We can control how a mapper is going to sort by configuring the comparator directive to use the special class KeyFieldBasedComparator. This class has some options similar to the Unix sort (-n to sort numerically, -r for reverse sorting, -k pos1[,pos2] for specifying fields to sort by).
  • 52. 52 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce: configure sort with KeyFieldBasedComparator COMPARATOR=org.apache.hadoop.mapred.lib.KeyFieldBasedCompara tor hadoop jar ${STREAMING_PATH}/hadoop-streaming-2.10.0.jar -D mapreduce.job.output.key.comparator.class=$COMPARATOR -D mapreduce.partition.keycomparator.options=-nr -file swap_keyval.sh -input wordcount_out -output wordcount_out3
  • 53. 53 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce exercise: change number of mappers or reducers Map Try to change the number of map and/or reduce tasks using the options (here we use for instance 10 map tasks and 2 reduce tasks): -D mapred.map.tasks=10 -D mapred.reduce.tasks=2 Does this improve the performance of your MapReduce job? How does the job output look like?
  • 54. 54 www.prace-ri.euIntroduction to Hadoop (part 1) MapReduce patterns What can MapReduce be used for? Data summarization by grouping, filtering, joining, etc. We have already seen the filtering pattern in our second transformation for wordcount: For more patterns see: “MapReduce Design Patterns” , O’Reilly
  • 55. 55 www.prace-ri.euIntroduction to Hadoop (part 1) Running Hadoop on your pc Hadoop can run on a single node (of course in this case no replication is permitted): Hadoop: Setting up a Single Node Cluster). In this case, no Yarn is needed. By default, Hadoop is configured to run in a non-distributed mode as a single Java process. In a single node setup, it’s possible to allow pseudo-distributed operation by allowing each Hadoop deamon to run as separate Java process. It is also possible to use MapReduce without HDFS, using the local filesystem.
  • 56. 56 www.prace-ri.euIntroduction to Hadoop (part 1) Why learn about HDFS and MapReduce? HDFS and MapReduce are part of most Big Data curricula even though one ultimately will probably use higher level frameworks for working with Big Data. One of the main limitations of Mapreduce is the disk I/O. The successor of Mapreduce, Apache Spark, offers better performance by orders of magnitude thanks to its in-memory processing. The Spark engine does not need HDFS. A higher level framework like Hive allows to access data using HQL (Hive Query Language), a language similar to SQL. Still, learning HDFS & MapReduce is useful because they exemplify in a simple way the foundational issues of the Hadoop approach to distributed computing.
  • 57. 57 www.prace-ri.euIntroduction to Hadoop (part 1) Recap ▶ what is Big Data? ▶ what is Apache Hadoop? ▶ some Hadoop features explained ▶ Architecture of a Hadoop cluster ▶ HDFS, the Hadoop Distributed File System ▶ the MapReduce framework ▶ MapReduce streaming ▶ the simplest MapReduce job ▶ MapReduce wordcount ▶ using the Hadoop comparator class ▶ MapReduce design patterns ▶ why learn HDFS and MapReduce? ▶ can I run Hadoop on my pc?
  • 58. 58 www.prace-ri.euIntroduction to Hadoop (part 1) THANK YOU FOR YOUR ATTENTION www.prace-ri.eu