Apache Hadoop

Apache Hadoop - Kumaresan Manickavelu

Problems With Scale Failure is the defining difference between distributed and local programming If components fail, their workload must be picked up by still-functioning units Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.

Hadoop Echo System Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.

HDFS Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode and a number of DataNodes HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients.

Map Reduce Provides a clean abstraction for programmers to write distributed application. Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools

Programming Model Programmer has to implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list

Mapper (indexing example) Input is the line no and the actual line. Input 1 : (“100”,“I Love India ”) Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”) Input 2 : (“101”,“I Love eBay”) Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)

Reducer (indexing example) Input is word and the line nos. Input 1 : (“I”,“100”,”101”) Input 2 : (“Love”,“100”,”101”) Input 3 : (“India”, “100”) Input 4 : (“eBay”, “101”) Output, the words are stored along with the line nos.

Google Page Rank example Mapper Input is a link and the html content Output is a list of outgoing link and pagerank of this page Reducer Input is a link and a list of pagranks of pages linking to this page Output is the pagerank of this page, which is the weighted average of all input pageranks

Hadoop at Yahoo World's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.

Hadoop at Amazon Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.

Thanks Questions? kumaresan . manickavelu @ gmail.com

Apache Hadoop

More Related Content

What's hot (19)

Similar to Apache Hadoop (20)

Recently uploaded (20)

Apache Hadoop

Editor's Notes