SlideShare a Scribd company logo
Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option
What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring
Basic attributes expected Resource management Disk CPU Memory Band width of network Fault tolerant Network failure Machine failure Job/code bug … Scalability
Hadoop Core Separate distributed file system based on google file system type architecture-->HDFS Separate job splitting and merging mechanism mapreduce framework on top of distributed file system Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat
HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication  Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
HDFS Copied from HDFS design document
Mapreduce framework attributes Fair isolation--> easy synchronization and fail over ...
Mapreduce Copied from yahoo tutorial
Copied from yahoo tutorial
Fault tolerant goal Hadoop assumes that at least one machine is down every time HDFS Block level replication Replicated and persistent metadata Rack awareness and consideration of whole rac failure
Fault tolerant goal contd..  Mapreduce No dependency assumed between tasks Tasks from a failed node can be transferred to other nodes without any state information Mapper--> whole tasks are to be executed in other nodes Reducer-->only un executed tasks are to be transmitted since all executed result are written to output
Resource management goal CPU/ Memory Mechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them ....
Resource management goal contd.. Bandwidth HDFS architecture ensures that the read request is served from the nearest node (replication) Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...
Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports  insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce
How about other frameworks ??
Questions ???

More Related Content

PPT
Hw09 Low Latency, Random Reads From Hdfs
PPTX
Distributed Processing Frameworks
PPTX
2013 year of real-time hadoop
PPTX
Unit 2.pptx
PPTX
Spark Overview and Performance Issues
PPTX
Juniper Innovation Contest
PPTX
Gfs vs hdfs
PDF
Apache Kudu
Hw09 Low Latency, Random Reads From Hdfs
Distributed Processing Frameworks
2013 year of real-time hadoop
Unit 2.pptx
Spark Overview and Performance Issues
Juniper Innovation Contest
Gfs vs hdfs
Apache Kudu

What's hot (19)

PDF
Netezza Deep Dives
ODP
HBase introduction talk
PDF
Netezza workload management
DOCX
Hadoop Research
PPTX
Resource scheduling
PPTX
Strata spark streaming2
PPTX
Introduction to HDFS and MapReduce
PPTX
Some thoughts on apache spark & shark
PPTX
Hadoop Introduction
PDF
02.28.13 WANdisco ApacheCon 2013
PDF
Hadoop-2.6.0 Slides
PDF
Introduction of MapReduce
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PPTX
Hadoop eco system-first class
PPTX
HDFS Federation++
PPTX
Hadoop: The elephant in the room
PDF
Hadoop, HDFS and MapReduce
PDF
Hadoop disaster recovery
PDF
Hadoop Cluster With High Availability
Netezza Deep Dives
HBase introduction talk
Netezza workload management
Hadoop Research
Resource scheduling
Strata spark streaming2
Introduction to HDFS and MapReduce
Some thoughts on apache spark & shark
Hadoop Introduction
02.28.13 WANdisco ApacheCon 2013
Hadoop-2.6.0 Slides
Introduction of MapReduce
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop eco system-first class
HDFS Federation++
Hadoop: The elephant in the room
Hadoop, HDFS and MapReduce
Hadoop disaster recovery
Hadoop Cluster With High Availability
Ad

Similar to Map Reduce (20)

PPT
PPTX
2. hadoop fundamentals
PDF
Hadoop distributed computing framework for big data
PPTX
Hadoop introduction
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
MOD-2 presentation on engineering students
PPT
Hadoop mapreduce and yarn frame work- unit5
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Hadoop tutorial for beginners-tibacademy.in
PDF
Hadoop overview.pdf
PPT
Meethadoop
PPTX
HADOOP.pptx
PPTX
Hadoop ppt1
PPTX
Managing Big data with Hadoop
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PPTX
Big Data Unit 4 - Hadoop
2. hadoop fundamentals
Hadoop distributed computing framework for big data
Hadoop introduction
Introduction to Hadoop and Big Data
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
MOD-2 presentation on engineering students
Hadoop mapreduce and yarn frame work- unit5
Hadoop and Big data in Big data and cloud.pptx
Hadoop tutorial for beginners-tibacademy.in
Hadoop overview.pdf
Meethadoop
HADOOP.pptx
Hadoop ppt1
Managing Big data with Hadoop
Introduction to Hadoop Administration
Introduction to Hadoop Administration
Big Data Unit 4 - Hadoop
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology

Map Reduce

  • 1. Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option
  • 2. What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring
  • 3. Basic attributes expected Resource management Disk CPU Memory Band width of network Fault tolerant Network failure Machine failure Job/code bug … Scalability
  • 4. Hadoop Core Separate distributed file system based on google file system type architecture-->HDFS Separate job splitting and merging mechanism mapreduce framework on top of distributed file system Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat
  • 5. HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
  • 6. HDFS Copied from HDFS design document
  • 7. Mapreduce framework attributes Fair isolation--> easy synchronization and fail over ...
  • 8. Mapreduce Copied from yahoo tutorial
  • 9. Copied from yahoo tutorial
  • 10. Fault tolerant goal Hadoop assumes that at least one machine is down every time HDFS Block level replication Replicated and persistent metadata Rack awareness and consideration of whole rac failure
  • 11. Fault tolerant goal contd.. Mapreduce No dependency assumed between tasks Tasks from a failed node can be transferred to other nodes without any state information Mapper--> whole tasks are to be executed in other nodes Reducer-->only un executed tasks are to be transmitted since all executed result are written to output
  • 12. Resource management goal CPU/ Memory Mechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them ....
  • 13. Resource management goal contd.. Bandwidth HDFS architecture ensures that the read request is served from the nearest node (replication) Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...
  • 14. Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
  • 15. Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce
  • 16. How about other frameworks ??