Managing Big data Module 3 (1st part)

Managing Big Data
Module 3 (1st part)
Guided By-
Mangala C.N.
Associate Professor
CSE Dept
EWIT, Bangalore
Presented By –
Soumee Maschatak
1EW18SCS07

Contents
1. Data Format
2. Analysing data with Hadoop
3. Scaling OUT
4. Data Flow
5. Hadoop Streaming
6. Hadoop Pipes

Hadoop Concepts
1. Distribute the data as it is initially stored in the system.
2. Individual nodes can work on data local to those nodes.
3. No data transfer over the network is required for initial processing. Developers do not
worry about network programming, temporal dependencies, Shared architecture.
4. Data is replicated multiple times on the system for the increased availability and
reliability.
5. The data on the system is split into blocks of 64MB and 128MB.
6. Map tasks work on relatively portions of data.
7. Master program allocates work to the nodes and manages high availability.

Data Format
1. Data is available everywhere and in different sizes and formats.
2. The Hadoop can many different types of data formats, from flat text files
to databases.
3. Data is captured by various applications like sensors, mobiles,
satellites, Social networks and from users of laptop/desktop.
4. For example – Meteorology department
5. There is tens of thousands of meteorology stations data that is stored in
zip files for each month.

Map and Reduce
1. The MapReduce processes the data in two cycles:The Map Phase and the Reduce Phase.
2. Both phases have key- value pairs as input and output types of which may be chosen thee
developer.
3. The developer also specifies two functions: the map function and the reduce function.
4. The input to our map phase is the raw NCDC meteorology data.
5. We chose a text input format that gives us each line in the dataset as a text value.
6. The key is the offset of the beginning of the line from the beginning of the file.

7. Map function is simple.The map function is just a data preparation phase, putting the
data in such a way that the reducer function can do its action on it easily.
8. In case of the meteorology station, finding the maximum wind speed for each city can be
done using a MapReduce function.
9. The map phase is also a good place to drop unwanted records.
10. The lines of data are fed to the map function as the key- value pair.
11. The keys are the line offsets within the input file.
12. The results from the map function is processed by the MApReduces framework before
being forwarded to the reduce function.
13. This processing sorts and groups the key-value pairs by key.

Java MapReduce
1. This step is all about writing the code for a MapReduce function.
2. There are 3 things to keep in mind for the Java MapReduce function:
a. A map function.
b. A reduce function.
c. Some code to run the job.
3. The map function is represented by the Mapper class which declares an abstract map() method.
4. The MapReduce model processes large unstructured data sets with a distributed algorithm on a
Hadoop cluster.
5. The MapReduce model processes large unstructured data sets with a distributed algorithm on a
Hadoop cluster.

Scaling Out
1. Scalability has two parts: UP and OUT.
2. Scale UP means that the system performs better as one adds more hardware to a single node in the
system.
3. Scaling OUT also involves adding more nodes to a distributed system.
4. When one builds a complex distributed system/application, one works with certain obstacles.The end
result has to scale out, so one can easily add more hardware resources in the face of higher load.
5. It really starts to show while on bigger clusters. It should be able to scale up in order to scale out well.
6. In order to scale out, one need to save the data in a distributed filesystem, typically HDFS(Hadoop
Distributed File System), to allow Hadoop to run the MapReduce computation on each machine
hosting a part of the data.

DataFlow
1. MapReduce job is the combination of the input data, the MapReduces code and the
configuration information.
2. Hadoop runs the job by dividing into two tasks: map tasks and reduce tasks.
3. Hadoop has two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers.
4. The jobtracker coordinates and collaborates all the jobs run on the sytem by
scheduling tasks to run on tasktrackers.
5. TaskTrackers run tasks and send progress reports to the jobracker, which keeps a
record of all the overall progress of each job.

6. If a task fails, the jobtracker can reschedule it on a different tasktracker.
7. Hadoop divides the input to a MapReduce job into fixed-size pieces called
input splits.
8. Hadoop creates one map task for each split, which runs the user defined map
function for each record in the split.
9. More splits means the time taken to process each split is short compared to
the time to process the complete input.
10. So if we are processing the splits in parallel, the processing is better load-
balanced if the splits are small.
11. Hadoop does its best to execute map task on a node where the input data
resides in HDFS.This is called data locality optimization since it doesn’t use
valuable cluster bandwidth.

Managing Big data Module 3 (1st part)

12. Map tasks write their results to the local disk, not to HDFS.
13. Map output data is intermediate output: its processed by reduce tasks to
produce the final result and once the job is complete the map output can
be deleted. So storing it in HDFS, with replication, would be an overkill.
14. If the map tasks fails in specific node before the map output has been
manipulated by the reduce task.
15. The result of reduce task is normally stored in HDFS for efficiency.
16. For each HDFS block of reduced output, the first replica is stored on the
local node, with the other replicas being stored on off-rack nodes.

Data Streaming
1. Hadoop offers an interface/API to MapReduce which will allow users to write the map
and reduce jobs in any language other than java. So programmers can have any language
to read input and write output to the MapReduce program like python and ruby.
2. Streaming is naturally suited for text processing which has a row-oriented view of data.
3. Input map data is passed to map function which processes it row by row and provides
lines/rows to standard output.
4. A map output key-value pair is written as a single tab-separated line.
5. Input to the reduce function is in the same format a tab separated key-value pair passed
over standard input.

6. The reduce function reads lines/rows from standard input, then the framework sorts by key and writes its
results to standard output.
7. Hadoop streaming is a utility that comes with the Hadoop distribution.The utility allows you to create and
run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
8. Streaming supports streaming command options as well as generic command options.

Parameter Optional/Required Description
-input directoryname or filename Required Input location for mapper
-output directoryname Required Output location for reducer
-mapper executable or JavaClassName Required Mapper executable
-reducer executable or JavaClassName Required Reducer executable
-file filename Optional Make the mapper, reducer, or combiner executable available locally on the
compute nodes
-inputformat JavaClassName Optional Class you supply should return key/value pairs ofText class. If not specified,
TextInputFormat is used as the default
-outputformat JavaClassName Optional Class you supply should take key/value pairs ofText class. If not specified,
TextOutputformat is used as the default
-partitioner JavaClassName Optional Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName Optional Combiner executable for map output
-cmdenv name=value Optional Pass environment variable to streaming commands
-inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input
format class)
-verbose Optional Verbose output
-lazyOutput Optional Create output lazily. For example, if the output format is based on
FileOutputFormat, the output file is created only on the first call to
output.collect (or Context.write)
-numReduceTasks Optional Specify the number of reducers
-mapdebug Optional Script to call when map task fails
-reducedebug Optional Script to call when reduce task fails
Hadoop streaming command options

Hadoop Pipes
1. Hadoop pipes are nothing but an C++ interface to Hadoop MapReduce.
2. Pipes uses sockets as the channel over which the task tracker interacts with the process running the C++
map or reduce task.
3. Unlike Streaming, which uses standard input and output to communicate with the map and reduce
code, Pipes uses sockets as the channel over which the tasktracker communicates with the process
running the C++ map or reduce function.
4. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the
Hadoop Pipes namespace and providing implementations of the map() and reduce() methods in each
case.
5. Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard
Template Library (STL) strings.This makes the interface simpler, although it does put a slightly greater
burden on the application developer, who has to convert to and from richer domain-level types.

Hadoop streaming and Hadoop pipes

Important Questions
1. Write a java script for Mapper and reducer considering weather dataset as an example,
output must retrieve maximum temperature for every year.
2. Describe with a neat diagram Map Reduce data flow with a single reduce task.
3. Explain map and reduce phase with an example.
4. Briefly explain the significance of data flow in distributed file system.
5. What are Hadoop pipes? Explain.
6. Explain different types of data input format and output format supported by Hadoop with
an example.
7. What is Hadoop pipes give a brief explanation with an example.
8. What is the function of a combiner in Map reduce? How does it differ from Reduce
function.

Managing Big data Module 3 (1st part)

More Related Content

What's hot (18)

Similar to Managing Big data Module 3 (1st part) (20)

Recently uploaded (20)

Managing Big data Module 3 (1st part)