CSE509 Lecture 4

CSE509: Introduction to Web Science and TechnologyLecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduceMuhammad AtifQureshiWeb Science Research GroupInstitute of Business Administration (IBA)

Last Time…Search Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 30, 2011

TodayWeb Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyJuly 30, 2011

IntroductionWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)“Big data” is a fact on the World Wide WebLarger data implies effective algorithmsWeb-scale processing: Data-intensive processingAlso applies to startups and niche playersJuly 30, 2011

How Much Data?Google processes 20 PB a day (2008)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s LHC will generate 15 PB a year (??)July 30, 2011

Cluster ArchitectureJuly 30, 2011CPUCPUCPUCPUMemMemMemMemDiskDiskDiskDisk2-10 Gbps backbone between racks1 Gbps between any pair of nodesin a rackSwitchSwitchSwitch……Each rack contains 16-64 nodes

ConcernsIf we had to abort and restart the computation every time one component fails, then the computation might never complete successfullyIf one node fails, all its files would be unavailable until the node is replacedCan also lead to permanent loss of filesJuly 30, 2011Solutions: MapReduce and Google File system

PART I: MapReduceJuly 30, 2011

Major IdeasScale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited bandwidthProcess data sequentially, avoid random accessSeeks are expensive, disk throughput is reasonableSeamless scalabilityFrom the traditional mythical man-month approach to a newly known phenomenon tradable machine-hourTwenty-one chicken together cannot make an egg hatch in a dayJuly 30, 2011

Traditional Parallelization: Divide and ConquerJuly 30, 2011“Work”Partitionw1w2w3“worker”“worker”“worker”r1r2r3Combine“Result”

Parallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?July 30, 2011

Common ThemeParallelization problems arise from:Communication between workers (e.g., to exchange state)Access to shared resources (e.g., data)Thus, we need a synchronization mechanismJuly 30, 2011

Parallelization is HardTraditionally, concurrency is difficult to reason about (uni to small-scale architecture)Concurrency is even more difficult to reason aboutAt the scale of datacenters (even across datacenters)In the presence of failuresIn terms of multiple interacting servicesNot to mention debugging…The reality:Write your own dedicated library, then program with itBurden on the programmer to explicitly manage everythingJuly 30, 2011

Solution: MapReduceProgramming model for expressing distributed computations at a massive scaleHides system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionJuly 30, 2011

What is MapReduce Used For?At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionJuly 30, 2011

Typical MapReduce ExecutionIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputMapReduceKey idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)

MapReduce BasicsProgrammers specify two functions:map (k, v) -> <k’, v’>*reduce (k’, v’) -> <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…July 30, 2011

Warm Up Example: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsJuly 30, 2011

Word Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cJuly 30, 2011

Word Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizableJuly 30, 2011

Word Count using MapReduceJuly 30, 2011map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)reduce(key, values):// key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)

Word Count IllustrationJuly 30, 2011map(key=url, val=contents):For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see 1bob 1 run 1see 1spot 1throw 1bob 1 run 1see 2spot 1throw 1see bob runsee spot throw

Implementation Overview100s/1000s of 2-CPU x86 machines, 2-4 GB of memoryLimited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011Implementation at Google is a C++ library linked to user programs

Distributed Execution OverviewJuly 30, 2011UserProgram(1) submitMaster(2) schedule map(2) schedule reduceworkersplit 0(6) writeoutputfile 0(5) remote readworkersplit 1(3) readsplit 2(4) local writeworkersplit 3outputfile 1split 4workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)

MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystemLots of custom research implementationsFor GPUs, cell processors, etc.July 30, 2011

Bonus AssignmentWrite MapReduce version of Assignment no. 2July 30, 2011

MapReduce in VisionerBOTJuly 30, 2011

VisionerBOT Distributed DesignJuly 30, 2011

PART II: Google File SystemJuly 30, 2011

Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Not enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop

GFS: AssumptionsCommodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas)

CSE509 Lecture 4

More Related Content

Viewers also liked (12)

Similar to CSE509 Lecture 4 (20)

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Recently uploaded (20)

CSE509 Lecture 4

Editor's Notes