SlideShare a Scribd company logo
Managing Big Data
Module 3 (1st part)
Guided By-
Mangala C.N.
Associate Professor
CSE Dept
EWIT, Bangalore
Presented By –
Soumee Maschatak
1EW18SCS07
Contents
1. Data Format
2. Analysing data with Hadoop
3. Scaling OUT
4. Data Flow
5. Hadoop Streaming
6. Hadoop Pipes
Hadoop Concepts
1. Distribute the data as it is initially stored in the system.
2. Individual nodes can work on data local to those nodes.
3. No data transfer over the network is required for initial processing. Developers do not
worry about network programming, temporal dependencies, Shared architecture.
4. Data is replicated multiple times on the system for the increased availability and
reliability.
5. The data on the system is split into blocks of 64MB and 128MB.
6. Map tasks work on relatively portions of data.
7. Master program allocates work to the nodes and manages high availability.
Data Format
1. Data is available everywhere and in different sizes and formats.
2. The Hadoop can many different types of data formats, from flat text files
to databases.
3. Data is captured by various applications like sensors, mobiles,
satellites, Social networks and from users of laptop/desktop.
4. For example – Meteorology department
5. There is tens of thousands of meteorology stations data that is stored in
zip files for each month.
Analysing data
with Hadoop
Map and Reduce
1. The MapReduce processes the data in two cycles:The Map Phase and the Reduce Phase.
2. Both phases have key- value pairs as input and output types of which may be chosen thee
developer.
3. The developer also specifies two functions: the map function and the reduce function.
4. The input to our map phase is the raw NCDC meteorology data.
5. We chose a text input format that gives us each line in the dataset as a text value.
6. The key is the offset of the beginning of the line from the beginning of the file.
7. Map function is simple.The map function is just a data preparation phase, putting the
data in such a way that the reducer function can do its action on it easily.
8. In case of the meteorology station, finding the maximum wind speed for each city can be
done using a MapReduce function.
9. The map phase is also a good place to drop unwanted records.
10. The lines of data are fed to the map function as the key- value pair.
11. The keys are the line offsets within the input file.
12. The results from the map function is processed by the MApReduces framework before
being forwarded to the reduce function.
13. This processing sorts and groups the key-value pairs by key.
Hadoop MapReduce
Java MapReduce
1. This step is all about writing the code for a MapReduce function.
2. There are 3 things to keep in mind for the Java MapReduce function:
a. A map function.
b. A reduce function.
c. Some code to run the job.
3. The map function is represented by the Mapper class which declares an abstract map() method.
4. The MapReduce model processes large unstructured data sets with a distributed algorithm on a
Hadoop cluster.
5. The MapReduce model processes large unstructured data sets with a distributed algorithm on a
Hadoop cluster.
Scaling Out
1. Scalability has two parts: UP and OUT.
2. Scale UP means that the system performs better as one adds more hardware to a single node in the
system.
3. Scaling OUT also involves adding more nodes to a distributed system.
4. When one builds a complex distributed system/application, one works with certain obstacles.The end
result has to scale out, so one can easily add more hardware resources in the face of higher load.
5. It really starts to show while on bigger clusters. It should be able to scale up in order to scale out well.
6. In order to scale out, one need to save the data in a distributed filesystem, typically HDFS(Hadoop
Distributed File System), to allow Hadoop to run the MapReduce computation on each machine
hosting a part of the data.
DataFlow
1. MapReduce job is the combination of the input data, the MapReduces code and the
configuration information.
2. Hadoop runs the job by dividing into two tasks: map tasks and reduce tasks.
3. Hadoop has two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers.
4. The jobtracker coordinates and collaborates all the jobs run on the sytem by
scheduling tasks to run on tasktrackers.
5. TaskTrackers run tasks and send progress reports to the jobracker, which keeps a
record of all the overall progress of each job.
6. If a task fails, the jobtracker can reschedule it on a different tasktracker.
7. Hadoop divides the input to a MapReduce job into fixed-size pieces called
input splits.
8. Hadoop creates one map task for each split, which runs the user defined map
function for each record in the split.
9. More splits means the time taken to process each split is short compared to
the time to process the complete input.
10. So if we are processing the splits in parallel, the processing is better load-
balanced if the splits are small.
11. Hadoop does its best to execute map task on a node where the input data
resides in HDFS.This is called data locality optimization since it doesn’t use
valuable cluster bandwidth.
Managing Big data Module 3 (1st part)
12. Map tasks write their results to the local disk, not to HDFS.
13. Map output data is intermediate output: its processed by reduce tasks to
produce the final result and once the job is complete the map output can
be deleted. So storing it in HDFS, with replication, would be an overkill.
14. If the map tasks fails in specific node before the map output has been
manipulated by the reduce task.
15. The result of reduce task is normally stored in HDFS for efficiency.
16. For each HDFS block of reduced output, the first replica is stored on the
local node, with the other replicas being stored on off-rack nodes.
Data Flow
Data Streaming
1. Hadoop offers an interface/API to MapReduce which will allow users to write the map
and reduce jobs in any language other than java. So programmers can have any language
to read input and write output to the MapReduce program like python and ruby.
2. Streaming is naturally suited for text processing which has a row-oriented view of data.
3. Input map data is passed to map function which processes it row by row and provides
lines/rows to standard output.
4. A map output key-value pair is written as a single tab-separated line.
5. Input to the reduce function is in the same format a tab separated key-value pair passed
over standard input.
6. The reduce function reads lines/rows from standard input, then the framework sorts by key and writes its
results to standard output.
7. Hadoop streaming is a utility that comes with the Hadoop distribution.The utility allows you to create and
run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
8. Streaming supports streaming command options as well as generic command options.
Parameter Optional/Required Description
-input directoryname or filename Required Input location for mapper
-output directoryname Required Output location for reducer
-mapper executable or JavaClassName Required Mapper executable
-reducer executable or JavaClassName Required Reducer executable
-file filename Optional Make the mapper, reducer, or combiner executable available locally on the
compute nodes
-inputformat JavaClassName Optional Class you supply should return key/value pairs ofText class. If not specified,
TextInputFormat is used as the default
-outputformat JavaClassName Optional Class you supply should take key/value pairs ofText class. If not specified,
TextOutputformat is used as the default
-partitioner JavaClassName Optional Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName Optional Combiner executable for map output
-cmdenv name=value Optional Pass environment variable to streaming commands
-inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input
format class)
-verbose Optional Verbose output
-lazyOutput Optional Create output lazily. For example, if the output format is based on
FileOutputFormat, the output file is created only on the first call to
output.collect (or Context.write)
-numReduceTasks Optional Specify the number of reducers
-mapdebug Optional Script to call when map task fails
-reducedebug Optional Script to call when reduce task fails
Hadoop streaming command options
Hadoop Pipes
1. Hadoop pipes are nothing but an C++ interface to Hadoop MapReduce.
2. Pipes uses sockets as the channel over which the task tracker interacts with the process running the C++
map or reduce task.
3. Unlike Streaming, which uses standard input and output to communicate with the map and reduce
code, Pipes uses sockets as the channel over which the tasktracker communicates with the process
running the C++ map or reduce function.
4. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the
Hadoop Pipes namespace and providing implementations of the map() and reduce() methods in each
case.
5. Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard
Template Library (STL) strings.This makes the interface simpler, although it does put a slightly greater
burden on the application developer, who has to convert to and from richer domain-level types.
Hadoop streaming and Hadoop pipes
Important Questions
1. Write a java script for Mapper and reducer considering weather dataset as an example,
output must retrieve maximum temperature for every year.
2. Describe with a neat diagram Map Reduce data flow with a single reduce task.
3. Explain map and reduce phase with an example.
4. Briefly explain the significance of data flow in distributed file system.
5. What are Hadoop pipes? Explain.
6. Explain different types of data input format and output format supported by Hadoop with
an example.
7. What is Hadoop pipes give a brief explanation with an example.
8. What is the function of a combiner in Map reduce? How does it differ from Reduce
function.
Managing Big data Module 3 (1st part)

More Related Content

PPTX
Map reduce and Hadoop on windows
PPT
Map Reduce
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
PDF
Topic 6: MapReduce Applications
PPTX
Introduction to map reduce
PPTX
Map reduce presentation
PPTX
MapReduce Paradigm
PDF
The google MapReduce
Map reduce and Hadoop on windows
Map Reduce
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Topic 6: MapReduce Applications
Introduction to map reduce
Map reduce presentation
MapReduce Paradigm
The google MapReduce

What's hot (18)

PPT
Map Reduce
PPT
Introduccion a Hadoop / Introduction to Hadoop
PDF
Large Scale Data Analysis with Map/Reduce, part I
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PDF
Mapreduce by examples
PPT
Hadoop Map Reduce
PPSX
MapReduce Scheduling Algorithms
PDF
Hadoop Map Reduce Arch
PPTX
writing Hadoop Map Reduce programs
PDF
An Introduction to MapReduce
PDF
Application of MapReduce in Cloud Computing
PPTX
MapReduce Paradigm
PDF
E031201032036
PPTX
PDF
Map Reduce Execution Architecture
PDF
Introduction to Map-Reduce
PPTX
Mapreduce advanced
Map Reduce
Introduccion a Hadoop / Introduction to Hadoop
Large Scale Data Analysis with Map/Reduce, part I
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Mapreduce by examples
Hadoop Map Reduce
MapReduce Scheduling Algorithms
Hadoop Map Reduce Arch
writing Hadoop Map Reduce programs
An Introduction to MapReduce
Application of MapReduce in Cloud Computing
MapReduce Paradigm
E031201032036
Map Reduce Execution Architecture
Introduction to Map-Reduce
Mapreduce advanced
Ad

Similar to Managing Big data Module 3 (1st part) (20)

PPT
Hadoop ppt2
PPTX
PPTX
Cppt Hadoop
PPTX
Map reducefunnyslide
PPTX
Hadoop training-in-hyderabad
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
S_MapReduce_Types_Formats_Features_07.pptx
PDF
Hadoop eco system with mapreduce hive and pig
PDF
Seminar_Report_hadoop
PPTX
Introduction to Hadoop and Big Data
PPT
Hadoop - Introduction to mapreduce
PPTX
Hadoop and MapReduce Introductort presentation
PPTX
Map reduce helpful for college students.pptx
PPTX
Hadoop live online training
PPT
Ababnsnsndjjdjdjdjjrjrjrjrjrjjrjrjrjjrjrjrjr
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
PDF
PPT
Hadoop distributed file system (HDFS), HDFS concept
Hadoop ppt2
Cppt Hadoop
Map reducefunnyslide
Hadoop training-in-hyderabad
Apache hadoop, hdfs and map reduce Overview
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
S_MapReduce_Types_Formats_Features_07.pptx
Hadoop eco system with mapreduce hive and pig
Seminar_Report_hadoop
Introduction to Hadoop and Big Data
Hadoop - Introduction to mapreduce
Hadoop and MapReduce Introductort presentation
Map reduce helpful for college students.pptx
Hadoop live online training
Ababnsnsndjjdjdjdjjrjrjrjrjrjjrjrjrjjrjrjrjr
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Hadoop distributed file system (HDFS), HDFS concept
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Geodesy 1.pptx...............................................
PPTX
Current and future trends in Computer Vision.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Artificial Intelligence
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Digital Logic Computer Design lecture notes
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Construction Project Organization Group 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
web development for engineering and engineering
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Geodesy 1.pptx...............................................
Current and future trends in Computer Vision.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Sustainable Sites - Green Building Construction
Artificial Intelligence
CH1 Production IntroductoryConcepts.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Digital Logic Computer Design lecture notes
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Construction Project Organization Group 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
OOP with Java - Java Introduction (Basics)
web development for engineering and engineering
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf

Managing Big data Module 3 (1st part)

  • 1. Managing Big Data Module 3 (1st part) Guided By- Mangala C.N. Associate Professor CSE Dept EWIT, Bangalore Presented By – Soumee Maschatak 1EW18SCS07
  • 2. Contents 1. Data Format 2. Analysing data with Hadoop 3. Scaling OUT 4. Data Flow 5. Hadoop Streaming 6. Hadoop Pipes
  • 3. Hadoop Concepts 1. Distribute the data as it is initially stored in the system. 2. Individual nodes can work on data local to those nodes. 3. No data transfer over the network is required for initial processing. Developers do not worry about network programming, temporal dependencies, Shared architecture. 4. Data is replicated multiple times on the system for the increased availability and reliability. 5. The data on the system is split into blocks of 64MB and 128MB. 6. Map tasks work on relatively portions of data. 7. Master program allocates work to the nodes and manages high availability.
  • 4. Data Format 1. Data is available everywhere and in different sizes and formats. 2. The Hadoop can many different types of data formats, from flat text files to databases. 3. Data is captured by various applications like sensors, mobiles, satellites, Social networks and from users of laptop/desktop. 4. For example – Meteorology department 5. There is tens of thousands of meteorology stations data that is stored in zip files for each month.
  • 6. Map and Reduce 1. The MapReduce processes the data in two cycles:The Map Phase and the Reduce Phase. 2. Both phases have key- value pairs as input and output types of which may be chosen thee developer. 3. The developer also specifies two functions: the map function and the reduce function. 4. The input to our map phase is the raw NCDC meteorology data. 5. We chose a text input format that gives us each line in the dataset as a text value. 6. The key is the offset of the beginning of the line from the beginning of the file.
  • 7. 7. Map function is simple.The map function is just a data preparation phase, putting the data in such a way that the reducer function can do its action on it easily. 8. In case of the meteorology station, finding the maximum wind speed for each city can be done using a MapReduce function. 9. The map phase is also a good place to drop unwanted records. 10. The lines of data are fed to the map function as the key- value pair. 11. The keys are the line offsets within the input file. 12. The results from the map function is processed by the MApReduces framework before being forwarded to the reduce function. 13. This processing sorts and groups the key-value pairs by key.
  • 9. Java MapReduce 1. This step is all about writing the code for a MapReduce function. 2. There are 3 things to keep in mind for the Java MapReduce function: a. A map function. b. A reduce function. c. Some code to run the job. 3. The map function is represented by the Mapper class which declares an abstract map() method. 4. The MapReduce model processes large unstructured data sets with a distributed algorithm on a Hadoop cluster. 5. The MapReduce model processes large unstructured data sets with a distributed algorithm on a Hadoop cluster.
  • 10. Scaling Out 1. Scalability has two parts: UP and OUT. 2. Scale UP means that the system performs better as one adds more hardware to a single node in the system. 3. Scaling OUT also involves adding more nodes to a distributed system. 4. When one builds a complex distributed system/application, one works with certain obstacles.The end result has to scale out, so one can easily add more hardware resources in the face of higher load. 5. It really starts to show while on bigger clusters. It should be able to scale up in order to scale out well. 6. In order to scale out, one need to save the data in a distributed filesystem, typically HDFS(Hadoop Distributed File System), to allow Hadoop to run the MapReduce computation on each machine hosting a part of the data.
  • 11. DataFlow 1. MapReduce job is the combination of the input data, the MapReduces code and the configuration information. 2. Hadoop runs the job by dividing into two tasks: map tasks and reduce tasks. 3. Hadoop has two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. 4. The jobtracker coordinates and collaborates all the jobs run on the sytem by scheduling tasks to run on tasktrackers. 5. TaskTrackers run tasks and send progress reports to the jobracker, which keeps a record of all the overall progress of each job.
  • 12. 6. If a task fails, the jobtracker can reschedule it on a different tasktracker. 7. Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits. 8. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. 9. More splits means the time taken to process each split is short compared to the time to process the complete input. 10. So if we are processing the splits in parallel, the processing is better load- balanced if the splits are small. 11. Hadoop does its best to execute map task on a node where the input data resides in HDFS.This is called data locality optimization since it doesn’t use valuable cluster bandwidth.
  • 14. 12. Map tasks write their results to the local disk, not to HDFS. 13. Map output data is intermediate output: its processed by reduce tasks to produce the final result and once the job is complete the map output can be deleted. So storing it in HDFS, with replication, would be an overkill. 14. If the map tasks fails in specific node before the map output has been manipulated by the reduce task. 15. The result of reduce task is normally stored in HDFS for efficiency. 16. For each HDFS block of reduced output, the first replica is stored on the local node, with the other replicas being stored on off-rack nodes.
  • 16. Data Streaming 1. Hadoop offers an interface/API to MapReduce which will allow users to write the map and reduce jobs in any language other than java. So programmers can have any language to read input and write output to the MapReduce program like python and ruby. 2. Streaming is naturally suited for text processing which has a row-oriented view of data. 3. Input map data is passed to map function which processes it row by row and provides lines/rows to standard output. 4. A map output key-value pair is written as a single tab-separated line. 5. Input to the reduce function is in the same format a tab separated key-value pair passed over standard input.
  • 17. 6. The reduce function reads lines/rows from standard input, then the framework sorts by key and writes its results to standard output. 7. Hadoop streaming is a utility that comes with the Hadoop distribution.The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 8. Streaming supports streaming command options as well as generic command options.
  • 18. Parameter Optional/Required Description -input directoryname or filename Required Input location for mapper -output directoryname Required Output location for reducer -mapper executable or JavaClassName Required Mapper executable -reducer executable or JavaClassName Required Reducer executable -file filename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes -inputformat JavaClassName Optional Class you supply should return key/value pairs ofText class. If not specified, TextInputFormat is used as the default -outputformat JavaClassName Optional Class you supply should take key/value pairs ofText class. If not specified, TextOutputformat is used as the default -partitioner JavaClassName Optional Class that determines which reduce a key is sent to -combiner streamingCommand or JavaClassName Optional Combiner executable for map output -cmdenv name=value Optional Pass environment variable to streaming commands -inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input format class) -verbose Optional Verbose output -lazyOutput Optional Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) -numReduceTasks Optional Specify the number of reducers -mapdebug Optional Script to call when map task fails -reducedebug Optional Script to call when reduce task fails Hadoop streaming command options
  • 19. Hadoop Pipes 1. Hadoop pipes are nothing but an C++ interface to Hadoop MapReduce. 2. Pipes uses sockets as the channel over which the task tracker interacts with the process running the C++ map or reduce task. 3. Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. 4. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the Hadoop Pipes namespace and providing implementations of the map() and reduce() methods in each case. 5. Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard Template Library (STL) strings.This makes the interface simpler, although it does put a slightly greater burden on the application developer, who has to convert to and from richer domain-level types.
  • 20. Hadoop streaming and Hadoop pipes
  • 21. Important Questions 1. Write a java script for Mapper and reducer considering weather dataset as an example, output must retrieve maximum temperature for every year. 2. Describe with a neat diagram Map Reduce data flow with a single reduce task. 3. Explain map and reduce phase with an example. 4. Briefly explain the significance of data flow in distributed file system. 5. What are Hadoop pipes? Explain. 6. Explain different types of data input format and output format supported by Hadoop with an example. 7. What is Hadoop pipes give a brief explanation with an example. 8. What is the function of a combiner in Map reduce? How does it differ from Reduce function.