SlideShare a Scribd company logo
CSE509: Introduction to Web Science and TechnologyLecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduceMuhammad AtifQureshiWeb Science Research GroupInstitute of Business Administration (IBA)
Last Time…Search Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 30, 2011
TodayWeb Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyJuly 30, 2011
IntroductionWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)“Big data” is a fact on the World Wide WebLarger data implies effective algorithmsWeb-scale processing: Data-intensive processingAlso applies to startups and niche playersJuly 30, 2011
How Much Data?Google processes 20 PB a day (2008)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s LHC will generate 15 PB a year (??)July 30, 2011
Cluster ArchitectureJuly 30, 2011CPUCPUCPUCPUMemMemMemMemDiskDiskDiskDisk2-10 Gbps backbone between racks1 Gbps between any pair of nodesin a rackSwitchSwitchSwitch……Each rack contains 16-64 nodes
ConcernsIf we had to abort and restart the computation every time one component fails, then the computation might never complete successfullyIf one node fails, all its files would be unavailable until the node is replacedCan also lead to permanent loss of filesJuly 30, 2011Solutions: MapReduce and Google File system
PART I: MapReduceJuly 30, 2011
Major IdeasScale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited bandwidthProcess data sequentially, avoid random accessSeeks are expensive, disk throughput is reasonableSeamless scalabilityFrom the traditional mythical man-month approach to a newly known phenomenon tradable machine-hourTwenty-one chicken together cannot make an egg hatch in a dayJuly 30, 2011
Traditional Parallelization: Divide and ConquerJuly 30, 2011“Work”Partitionw1w2w3“worker”“worker”“worker”r1r2r3Combine“Result”
Parallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?July 30, 2011
Common ThemeParallelization problems arise from:Communication between workers (e.g., to exchange state)Access to shared resources (e.g., data)Thus, we need a synchronization mechanismJuly 30, 2011
Parallelization is HardTraditionally, concurrency is difficult to reason about (uni to small-scale architecture)Concurrency is even more difficult to reason aboutAt the scale of datacenters (even across datacenters)In the presence of failuresIn terms of multiple interacting servicesNot to mention debugging…The reality:Write your own dedicated library, then program with itBurden on the programmer to explicitly manage everythingJuly 30, 2011
Solution: MapReduceProgramming model for expressing distributed computations at a massive scaleHides system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionJuly 30, 2011
What is MapReduce Used For?At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionJuly 30, 2011
Typical MapReduce ExecutionIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputMapReduceKey idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)
MapReduce BasicsProgrammers specify two functions:map (k, v) -> <k’, v’>*reduce (k’, v’) -> <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…July 30, 2011
Warm Up Example: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsJuly 30, 2011
Word Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cJuly 30, 2011
Word Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizableJuly 30, 2011
Word Count using MapReduceJuly 30, 2011map(key, value):// key: document name; value: text of document	for each word w in value:		emit(w, 1)reduce(key, values):// key: a word; values: an iterator over counts	result = 0	for each count v in values:		result += v	emit(key,result)
Word Count IllustrationJuly 30, 2011map(key=url, val=contents):For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see	1bob	1 run	1see 	1spot 	1throw	1bob	1 run	1see 	2spot 	1throw	1see bob runsee spot throw
Implementation Overview100s/1000s of 2-CPU x86 machines, 2-4 GB of memoryLimited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011Implementation at Google is a C++ library linked to user programs
Distributed Execution OverviewJuly 30, 2011UserProgram(1) submitMaster(2) schedule map(2) schedule reduceworkersplit 0(6) writeoutputfile 0(5) remote readworkersplit 1(3) readsplit 2(4) local writeworkersplit 3outputfile 1split 4workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)
MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystemLots of custom research implementationsFor GPUs, cell processors, etc.July 30, 2011
Bonus AssignmentWrite MapReduce version of Assignment no. 2July 30, 2011
MapReduce in VisionerBOTJuly 30, 2011
VisionerBOT Distributed DesignJuly 30, 2011
PART II: Google File SystemJuly 30, 2011
Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Not enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop
GFS: AssumptionsCommodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas)
QUESTIONS?July 30, 2011

More Related Content

PDF
Data Structures for Statistical Computing in Python
PDF
Big Data on Implementation of Many to Many Clustering
PDF
Scipy 2011 Time Series Analysis in Python
PDF
Structured Data Challenges in Finance and Statistics
DOCX
Ijircce publish this paper
PDF
The Materials Project: Experiences from running a million computational scien...
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PPTX
Big Data Analysis : Deciphering the haystack
Data Structures for Statistical Computing in Python
Big Data on Implementation of Many to Many Clustering
Scipy 2011 Time Series Analysis in Python
Structured Data Challenges in Finance and Statistics
Ijircce publish this paper
The Materials Project: Experiences from running a million computational scien...
Making Machine Learning Scale: Single Machine and Distributed
Big Data Analysis : Deciphering the haystack

Viewers also liked (12)

PPSX
Motivation
PDF
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
PPTX
Search engine optimization
PDF
Link Analysis (RBY)
PDF
Tutorial 7 (link analysis)
PPT
Link Analysis
PDF
Link analysis for web search
PDF
PDF
Link Analysis in Networks - or - Finding the Terrorists
PPSX
Analysis on link networks of iran municipal websites
PDF
Link analysis
DOCX
Link analysis .. Data Mining
Motivation
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Search engine optimization
Link Analysis (RBY)
Tutorial 7 (link analysis)
Link Analysis
Link analysis for web search
Link Analysis in Networks - or - Finding the Terrorists
Analysis on link networks of iran municipal websites
Link analysis
Link analysis .. Data Mining
Ad

Similar to CSE509 Lecture 4 (20)

PPT
Hadoop
PPT
Introduction To Map Reduce
PPT
Map reducecloudtech
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
PPTX
Hadoop and Mapreduce for .NET User Group
PPT
Behm Shah Pagerank
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
This gives a brief detail about big data
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PPTX
MapReduce.pptx
ODP
Training
PPTX
introduction to Complete Map and Reduce Framework
PDF
MapReduce Algorithm Design
PDF
Processing Big Data: An Introduction to Data Intensive Computing
PPTX
An introduction to Hadoop for large scale data analysis
PPT
mapreduce and hadoop Distributed File sysytem
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Hadoop
Introduction To Map Reduce
Map reducecloudtech
Lecture2-MapReduce - An introductory lecture to Map Reduce
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Hadoop and Mapreduce for .NET User Group
Behm Shah Pagerank
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
This gives a brief detail about big data
L19CloudMapReduce introduction for cloud computing .ppt
MapReduce.pptx
Training
introduction to Complete Map and Reduce Framework
MapReduce Algorithm Design
Processing Big Data: An Introduction to Data Intensive Computing
An introduction to Hadoop for large scale data analysis
mapreduce and hadoop Distributed File sysytem
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Ad

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.

CSE509 Lecture 4

  • 1. CSE509: Introduction to Web Science and TechnologyLecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduceMuhammad AtifQureshiWeb Science Research GroupInstitute of Business Administration (IBA)
  • 2. Last Time…Search Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 30, 2011
  • 3. TodayWeb Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyJuly 30, 2011
  • 4. IntroductionWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)“Big data” is a fact on the World Wide WebLarger data implies effective algorithmsWeb-scale processing: Data-intensive processingAlso applies to startups and niche playersJuly 30, 2011
  • 5. How Much Data?Google processes 20 PB a day (2008)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s LHC will generate 15 PB a year (??)July 30, 2011
  • 6. Cluster ArchitectureJuly 30, 2011CPUCPUCPUCPUMemMemMemMemDiskDiskDiskDisk2-10 Gbps backbone between racks1 Gbps between any pair of nodesin a rackSwitchSwitchSwitch……Each rack contains 16-64 nodes
  • 7. ConcernsIf we had to abort and restart the computation every time one component fails, then the computation might never complete successfullyIf one node fails, all its files would be unavailable until the node is replacedCan also lead to permanent loss of filesJuly 30, 2011Solutions: MapReduce and Google File system
  • 9. Major IdeasScale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited bandwidthProcess data sequentially, avoid random accessSeeks are expensive, disk throughput is reasonableSeamless scalabilityFrom the traditional mythical man-month approach to a newly known phenomenon tradable machine-hourTwenty-one chicken together cannot make an egg hatch in a dayJuly 30, 2011
  • 10. Traditional Parallelization: Divide and ConquerJuly 30, 2011“Work”Partitionw1w2w3“worker”“worker”“worker”r1r2r3Combine“Result”
  • 11. Parallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?July 30, 2011
  • 12. Common ThemeParallelization problems arise from:Communication between workers (e.g., to exchange state)Access to shared resources (e.g., data)Thus, we need a synchronization mechanismJuly 30, 2011
  • 13. Parallelization is HardTraditionally, concurrency is difficult to reason about (uni to small-scale architecture)Concurrency is even more difficult to reason aboutAt the scale of datacenters (even across datacenters)In the presence of failuresIn terms of multiple interacting servicesNot to mention debugging…The reality:Write your own dedicated library, then program with itBurden on the programmer to explicitly manage everythingJuly 30, 2011
  • 14. Solution: MapReduceProgramming model for expressing distributed computations at a massive scaleHides system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionJuly 30, 2011
  • 15. What is MapReduce Used For?At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionJuly 30, 2011
  • 16. Typical MapReduce ExecutionIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputMapReduceKey idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)
  • 17. MapReduce BasicsProgrammers specify two functions:map (k, v) -> <k’, v’>*reduce (k’, v’) -> <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…July 30, 2011
  • 18. Warm Up Example: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsJuly 30, 2011
  • 19. Word Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cJuly 30, 2011
  • 20. Word Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizableJuly 30, 2011
  • 21. Word Count using MapReduceJuly 30, 2011map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)reduce(key, values):// key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
  • 22. Word Count IllustrationJuly 30, 2011map(key=url, val=contents):For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see 1bob 1 run 1see 1spot 1throw 1bob 1 run 1see 2spot 1throw 1see bob runsee spot throw
  • 23. Implementation Overview100s/1000s of 2-CPU x86 machines, 2-4 GB of memoryLimited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011Implementation at Google is a C++ library linked to user programs
  • 24. Distributed Execution OverviewJuly 30, 2011UserProgram(1) submitMaster(2) schedule map(2) schedule reduceworkersplit 0(6) writeoutputfile 0(5) remote readworkersplit 1(3) readsplit 2(4) local writeworkersplit 3outputfile 1split 4workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)
  • 25. MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystemLots of custom research implementationsFor GPUs, cell processors, etc.July 30, 2011
  • 26. Bonus AssignmentWrite MapReduce version of Assignment no. 2July 30, 2011
  • 29. PART II: Google File SystemJuly 30, 2011
  • 30. Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Not enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop
  • 31. GFS: AssumptionsCommodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 32. GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas)

Editor's Notes

  • #7: 2 In traditional high-performance computing (HPC) applications (e.g.,for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloadsare not very processor-demanding, which means that the separation of compute and storage createsa bottleneck in the network. As an alternative to moving data around, it is more efficient to movethe processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on theprocessor directly attached to the block of data we need. The distributed file system is responsiblefor managing the data over which MapReduce operates.3 Data-intensive processing by definition meansthat the relevant datasets are too large to fit in memory and must be held on disk. Seek times forrandom disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoidrandom data access, and instead organize computations so that data are processed sequentially. Asimple scenario10 poignantly illustrates the large performance gap between sequential operationsand random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonableassumptions about disk latency and throughput, a back-of-the-envelop calculation will show thatupdating 1% of the records (by accessing and then mutating each record) will take about a monthon a single machine. On the other hand, if one simply reads the entire database and rewrites allthe records (mutating those that need updating), the process would finish in under a work day ona single machine. Sequential data access is, literally, orders of magnitude faster than random dataaccess.11The development of solid-state drives is unlikely to change this balance for at least tworeasons. First, the cost differential between traditional magnetic disks and solid-state disks remainssubstantial: large-data will for the most part remain on mechanical drives, at least in the nearfuture. Second, although solid-state disks have substantially faster seek times, order-of-magnitudedifferences in performance between sequential and random access still remain.MapReduce is primarily designed for batch processing over large datasets. To the extentpossible, all computations are organized into long streaming operations that take advantage of theaggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly tradelatency for throughput.