SlideShare a Scribd company logo
Apache Hadoop - Kumaresan Manickavelu
Problems With Scale Failure is the defining difference between distributed and local programming If components fail, their workload must be picked up by still-functioning units Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.
Hadoop Echo System Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.
HDFS Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture HDFS has a master/slave architecture  An HDFS cluster consists of a single NameNode and a number of DataNodes  HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.  The DataNodes are responsible for serving read and write requests from the file system’s clients.
Map Reduce Provides a clean abstraction for programmers to write distributed application.  Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools
Programming Model Programmer has to implement interface of two functions: –  map (in_key, in_value) -> (out_key, intermediate_value) list –  reduce (out_key, intermediate_value list) ->   out_value list
Map Reduce Flow
Mapper (indexing example) Input is the line no and the actual line. Input  1 :  (“100”,“I Love India ”)  Output  1 :  (“I”,“100”), (“Love”,“100”), (“India”,“100”)  Input  2 :  (“101”,“I Love eBay”)  Output  2 :  (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)
Reducer (indexing example) Input is word and the line nos.  Input  1 : (“I”,“100”,”101”)  Input  2 :  (“Love”,“100”,”101”) Input  3 :  (“India”, “100”) Input  4 :  (“eBay”, “101”) Output, the words are stored along with the line nos.
Google Page Rank example Mapper Input is a link and the html content Output is a list of outgoing link and pagerank of this page Reducer Input is a link and a list of pagranks of pages linking to this page Output is the pagerank of this page, which is the weighted average of all input pageranks
Hadoop at Yahoo World's largest Hadoop production application.  The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.
Hadoop at Amazon Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)  The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240  Amazon Elastic MapReduce  is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.
Thanks Questions? kumaresan . manickavelu @ gmail.com

More Related Content

PPTX
Hive and data analysis using pandas
PPTX
PPTX
Hadoop workshop
PPTX
Hadoop An Introduction
PDF
5 things one must know about spark!
PPTX
Analysing big data with cluster service and R
KEY
Intro to Hadoop
PPT
Another Intro To Hadoop
Hive and data analysis using pandas
Hadoop workshop
Hadoop An Introduction
5 things one must know about spark!
Analysing big data with cluster service and R
Intro to Hadoop
Another Intro To Hadoop

What's hot (19)

PPTX
Apache hive introduction
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
ODP
An introduction to Apache Hadoop Hive
PDF
Facebook Hadoop Data & Applications
PPTX
Spark core
PPTX
Big data and tools
PPTX
Big data and Hadoop
PPTX
Big data Hadoop presentation
PDF
PPT
Hive @ Hadoop day seattle_2010
PPTX
Hadoop Architecture
PPTX
Hadoop vs Apache Spark
PPTX
What is hadoop
PPTX
Spark Sql and DataFrame
PPT
Big Data Fundamentals in the Emerging New Data World
PPTX
Apache Hive
PPTX
Hadoop data ingestion
PDF
Migrating structured data between Hadoop and RDBMS
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Apache hive introduction
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
An introduction to Apache Hadoop Hive
Facebook Hadoop Data & Applications
Spark core
Big data and tools
Big data and Hadoop
Big data Hadoop presentation
Hive @ Hadoop day seattle_2010
Hadoop Architecture
Hadoop vs Apache Spark
What is hadoop
Spark Sql and DataFrame
Big Data Fundamentals in the Emerging New Data World
Apache Hive
Hadoop data ingestion
Migrating structured data between Hadoop and RDBMS
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Ad

Similar to Apache Hadoop (20)

PDF
Hadoop pig
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Hadoop Overview & Architecture
 
PPT
Apache hadoop, hdfs and map reduce Overview
PPT
Hadoop basics
PPT
Presentation
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PDF
Hadoop Overview kdd2011
PDF
The Family of Hadoop
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
PDF
big data analytics introduction chapter 2
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
PPTX
Big data week presentation
PPTX
2012 apache hadoop_map_reduce_windows_azure
PPTX
Managing Big data with Hadoop
PDF
Big data overview of apache hadoop
PDF
Big data overview of apache hadoop
PDF
Hadoop programming
Hadoop pig
Apache Hadoop & Friends at Utah Java User's Group
Hadoop Overview & Architecture
 
Apache hadoop, hdfs and map reduce Overview
Hadoop basics
Presentation
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop Overview kdd2011
The Family of Hadoop
TheEdge10 : Big Data is Here - Hadoop to the Rescue
big data analytics introduction chapter 2
The Evolution of the Hadoop Ecosystem
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Big data week presentation
2012 apache hadoop_map_reduce_windows_azure
Managing Big data with Hadoop
Big data overview of apache hadoop
Big data overview of apache hadoop
Hadoop programming
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Assigned Numbers - 2025 - Bluetooth® Document
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)

Apache Hadoop

  • 1. Apache Hadoop - Kumaresan Manickavelu
  • 2. Problems With Scale Failure is the defining difference between distributed and local programming If components fail, their workload must be picked up by still-functioning units Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.
  • 3. Hadoop Echo System Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.
  • 4. HDFS Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode and a number of DataNodes HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients.
  • 5. Map Reduce Provides a clean abstraction for programmers to write distributed application. Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools
  • 6. Programming Model Programmer has to implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list
  • 8. Mapper (indexing example) Input is the line no and the actual line. Input 1 : (“100”,“I Love India ”) Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”) Input 2 : (“101”,“I Love eBay”) Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)
  • 9. Reducer (indexing example) Input is word and the line nos. Input 1 : (“I”,“100”,”101”) Input 2 : (“Love”,“100”,”101”) Input 3 : (“India”, “100”) Input 4 : (“eBay”, “101”) Output, the words are stored along with the line nos.
  • 10. Google Page Rank example Mapper Input is a link and the html content Output is a list of outgoing link and pagerank of this page Reducer Input is a link and a list of pagranks of pages linking to this page Output is the pagerank of this page, which is the weighted average of all input pageranks
  • 11. Hadoop at Yahoo World's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.
  • 12. Hadoop at Amazon Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.
  • 13. Thanks Questions? kumaresan . manickavelu @ gmail.com

Editor's Notes

  • #3: One node failing every day. Then in a cluster of 365 nodes one node will fail every day. Ebay Pools example. Example of thread and spring. Example of thumbs pool cache.