SlideShare a Scribd company logo
MCS 7106: Advanced Topics in Computer Science
Simon Alex and Nambaale
Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
Overview
1 Hadoop
Hadoop Overview
MapReduce
Future Hadoop
Pros and Cons
Pilot Implementation
Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
What is Hadoop?
Hadoop is “an open source software platform for distributed storage
and distributed processing of very large data sets on computer
clusters built from commodity hardware”-Hortonworks.
Hadoop software platform mitigates the three-dimensions (referred to
as 3V’s) of data management challenges including: volume, velocity
and variety.
Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
Hadoop Origin
Google published GFS and MapReduce papers in 2003-2004.
Yahoo! was building “Nutch”, an open source web search engine at
the same time.
Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
Why Hadoop?
Disk seek times
Hardware failures
Processing times
Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
World of Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
HDFS
HDFS is based on Google’s GFS
Handles big files
HDFS breaks big files into blocks
Stored across several commodity computers
Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
HDFS Architecture
HDFS comprises of two important components: a name-node (the
master) and a number of datanodes (workers).
The NameNode serves all metadata operations on the file system like
creating, opening, closing or renaming files and directories.
Datanodes store and retrieve blocks when they are told to (by clients
or the namenode).
Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
Reading a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
Writing a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
NameNode Resilience
Backup Metadata-name node writes to the local disk and NFS
Secondary Namenode-maintains merged copy of edit log
HDFS Federation-each namenode manages a specific namespace
HDFS High Availability-hot standby namenode using shared edit log
Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
Using HDFS
UI (Ambari)
Command-Line Interface
HTTP / HDFS Proxies
Java Interface
NFS Gateway
Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
MapReduce
MapReduce is a programming model and implementation developed at
Google for processing and generating large datasets across a cluster of
computers.
MapReduce is a core component of Apache Hadoop, which distributes
processing on a cluster of computers.
Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
MapReduce Programming Model
This programming model is inspired∗ by the map and reduce primitives
of functional programming languages such as Lisp.
map: takes as input a procedure and a sequence of values and applies
the procedure to each value in the sequence.
reduce: takes as input a sequence of values and combines all values
using binary operator.
∗
but not equivalent!
Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
How MapReduce Works?
MapReduce works by breaking the processing into two phases: the map
phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer.The programmer also specifies two
functions: the map function and the reduce function.
Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
MapReduce Example
Challenge
What’s the highest ever recorded Makerere’s CGPA for each year?
Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
MapReduce Example
Figure: MapReduce logical data flow
Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
Recent Developments
TonY (TensorFlow on YARN)
Hadoop Encryption
HDFS High Availabilty Enhancement
Ozone
Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
Strengths and Weaknesses
Strengths
Varied Data sources
Cost effective
Performance
Fault tolerant
High availability
Low network traffic
Scalable
Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
Strengths and Weaknesses
Weaknesses
Issue with small files
Processing overhead
Supports only batch processing
Iterative processing
Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
Where is Hadoop used?
LinkedLn Assessment
Question Calibration
Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
Pilot Implementation
UI (Ambari)
Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
Installing the dataset into HDFS
Using Ambari
Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
Installing the dataset into HDFS
Using Command Line Interface
Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
MapReduce
Writing the Mapper
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp) = line.split(’t’)
yield rating , 1
Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
MapReduce
Writing the Reducer
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
MapReduce
Putting it all Together
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown (MRJob ):
def steps(self ):
return [
MRStep(mapper=self.mapper_get_ratings ,
reducer=self. reducer_count_ratings )
]
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp )= line.split(’t’)
yield rating , 1
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
if __name__ == ’__main__ ’:
RatingsBreakdown .run()
Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
MapReduce
Running in Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
Questions?
Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29

More Related Content

PDF
Map Reduce Presentation
PPTX
MapMap-Reduce recipes in with c#
PDF
Infinum Android Talks #04 - Google Maps Android API utility library
PPTX
John_Tucker_AIAA_Presentation
PDF
Creating a Better Infrastructure to Manage Big Data
PPTX
Building the world with Elastic Map Reduce
PPTX
Geodata Processing and Webservices with Python and Azure
PPT
Introduccion a Hadoop / Introduction to Hadoop
Map Reduce Presentation
MapMap-Reduce recipes in with c#
Infinum Android Talks #04 - Google Maps Android API utility library
John_Tucker_AIAA_Presentation
Creating a Better Infrastructure to Manage Big Data
Building the world with Elastic Map Reduce
Geodata Processing and Webservices with Python and Azure
Introduccion a Hadoop / Introduction to Hadoop

Similar to Hadoop presentation (20)

PPT
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PPTX
Hadoop and MapReduce Introductort presentation
PPTX
PDF
A Brief on MapReduce Performance
PPTX
Hadoop/MapReduce/HDFS
PPTX
PPTX
PPTX
Cppt Hadoop
PPT
Hadoop ppt2
PPTX
Hadoop training-in-hyderabad
PPT
hadoop
PPT
hadoop
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
Seminar_Report_hadoop
PPT
Seminar Presentation Hadoop
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
PDF
MapReduce
PPT
Presentation
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Hadoop and MapReduce Introductort presentation
A Brief on MapReduce Performance
Hadoop/MapReduce/HDFS
Cppt Hadoop
Hadoop ppt2
Hadoop training-in-hyderabad
hadoop
hadoop
Apache hadoop, hdfs and map reduce Overview
Seminar_Report_hadoop
Seminar Presentation Hadoop
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
MapReduce
Presentation
Ad

Recently uploaded (20)

PDF
Five Habits of High-Impact Board Members
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Hybrid model detection and classification of lung cancer
PPTX
Tartificialntelligence_presentation.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Architecture types and enterprise applications.pdf
PDF
CloudStack 4.21: First Look Webinar slides
PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
What is a Computer? Input Devices /output devices
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
August Patch Tuesday
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Five Habits of High-Impact Board Members
DP Operators-handbook-extract for the Mautical Institute
Developing a website for English-speaking practice to English as a foreign la...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Web Crawler for Trend Tracking Gen Z Insights.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Hybrid model detection and classification of lung cancer
Tartificialntelligence_presentation.pptx
O2C Customer Invoices to Receipt V15A.pptx
Group 1 Presentation -Planning and Decision Making .pptx
A comparative study of natural language inference in Swahili using monolingua...
Architecture types and enterprise applications.pdf
CloudStack 4.21: First Look Webinar slides
WOOl fibre morphology and structure.pdf for textiles
What is a Computer? Input Devices /output devices
NewMind AI Weekly Chronicles – August ’25 Week III
Univ-Connecticut-ChatGPT-Presentaion.pdf
August Patch Tuesday
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Ad

Hadoop presentation

  • 1. MCS 7106: Advanced Topics in Computer Science Simon Alex and Nambaale Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
  • 2. Overview 1 Hadoop Hadoop Overview MapReduce Future Hadoop Pros and Cons Pilot Implementation Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
  • 3. What is Hadoop? Hadoop is “an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware”-Hortonworks. Hadoop software platform mitigates the three-dimensions (referred to as 3V’s) of data management challenges including: volume, velocity and variety. Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
  • 4. Hadoop Origin Google published GFS and MapReduce papers in 2003-2004. Yahoo! was building “Nutch”, an open source web search engine at the same time. Hadoop was primarily driven by Doug Cutting and Tom White in 2006. Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
  • 5. Why Hadoop? Disk seek times Hardware failures Processing times Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
  • 6. World of Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
  • 7. HDFS HDFS is based on Google’s GFS Handles big files HDFS breaks big files into blocks Stored across several commodity computers Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
  • 8. HDFS Architecture HDFS comprises of two important components: a name-node (the master) and a number of datanodes (workers). The NameNode serves all metadata operations on the file system like creating, opening, closing or renaming files and directories. Datanodes store and retrieve blocks when they are told to (by clients or the namenode). Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
  • 9. Reading a File Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
  • 10. Writing a File Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
  • 11. NameNode Resilience Backup Metadata-name node writes to the local disk and NFS Secondary Namenode-maintains merged copy of edit log HDFS Federation-each namenode manages a specific namespace HDFS High Availability-hot standby namenode using shared edit log Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
  • 12. Using HDFS UI (Ambari) Command-Line Interface HTTP / HDFS Proxies Java Interface NFS Gateway Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
  • 13. MapReduce MapReduce is a programming model and implementation developed at Google for processing and generating large datasets across a cluster of computers. MapReduce is a core component of Apache Hadoop, which distributes processing on a cluster of computers. Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
  • 14. MapReduce Programming Model This programming model is inspired∗ by the map and reduce primitives of functional programming languages such as Lisp. map: takes as input a procedure and a sequence of values and applies the procedure to each value in the sequence. reduce: takes as input a sequence of values and combines all values using binary operator. ∗ but not equivalent! Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
  • 15. How MapReduce Works? MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer.The programmer also specifies two functions: the map function and the reduce function. Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
  • 16. MapReduce Example Challenge What’s the highest ever recorded Makerere’s CGPA for each year? Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
  • 17. MapReduce Example Figure: MapReduce logical data flow Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
  • 18. Recent Developments TonY (TensorFlow on YARN) Hadoop Encryption HDFS High Availabilty Enhancement Ozone Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
  • 19. Strengths and Weaknesses Strengths Varied Data sources Cost effective Performance Fault tolerant High availability Low network traffic Scalable Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
  • 20. Strengths and Weaknesses Weaknesses Issue with small files Processing overhead Supports only batch processing Iterative processing Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
  • 21. Where is Hadoop used? LinkedLn Assessment Question Calibration Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
  • 22. Pilot Implementation UI (Ambari) Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
  • 23. Installing the dataset into HDFS Using Ambari Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
  • 24. Installing the dataset into HDFS Using Command Line Interface Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
  • 25. MapReduce Writing the Mapper def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp) = line.split(’t’) yield rating , 1 Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
  • 26. MapReduce Writing the Reducer def reducer_count_ratings (self , key , values ): yield key , sum(values) Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
  • 27. MapReduce Putting it all Together from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown (MRJob ): def steps(self ): return [ MRStep(mapper=self.mapper_get_ratings , reducer=self. reducer_count_ratings ) ] def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp )= line.split(’t’) yield rating , 1 def reducer_count_ratings (self , key , values ): yield key , sum(values) if __name__ == ’__main__ ’: RatingsBreakdown .run() Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
  • 28. MapReduce Running in Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
  • 29. Questions? Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29