Map Reduce

Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option

What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring

Basic attributes expected Resource management Disk CPU Memory Band width of network Fault tolerant Network failure Machine failure Job/code bug … Scalability

Hadoop Core Separate distributed file system based on google file system type architecture-->HDFS Separate job splitting and merging mechanism mapreduce framework on top of distributed file system Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat

HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.

HDFS Copied from HDFS design document

Mapreduce framework attributes Fair isolation--> easy synchronization and fail over ...

Mapreduce Copied from yahoo tutorial

Fault tolerant goal Hadoop assumes that at least one machine is down every time HDFS Block level replication Replicated and persistent metadata Rack awareness and consideration of whole rac failure

Fault tolerant goal contd.. Mapreduce No dependency assumed between tasks Tasks from a failed node can be transferred to other nodes without any state information Mapper--> whole tasks are to be executed in other nodes Reducer-->only un executed tasks are to be transmitted since all executed result are written to output

Resource management goal CPU/ Memory Mechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them ....

Resource management goal contd.. Bandwidth HDFS architecture ensures that the read request is served from the nearest node (replication) Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...

Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward

Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce

Map Reduce

More Related Content

What's hot (19)

Similar to Map Reduce (20)

Recently uploaded (20)

Map Reduce