SlideShare a Scribd company logo
Cloud Architectures - Jinesh Varia - GrepTheWeb
On Cloud Computing….

  “We in academia and the government labs have not
    kept up with the times, Universities really need to
    get on board.”
    - Randal E. Bryant, Dean of the Computer
    Science School at Carnegie Mellon University.




  source: http://www.nytimes.com/2007/10/08/technology/08cloud.html
What is Amazon?




                  3
Amazon.com and AWS
                               Bandwidth consumed by
                               Amazon Web Services




                 Bandwidth consumed by
                 Amazon’s global websites




   2001
   1996   2002
          1997       2003
                     1998      2004
                               1999         2005
                                            2000   2006
                                                   2001   2007
                                                          2002   2008
AWS Customer Momentum (490,000)

    Q1 2006


    Q1 2007


    Q1 2008


    Q4 2008

              0   100   200   300   400   500   600
Amazon S3 Momentum




     800,000,000   5,000,000,000   10,000,000,000   40,000,000,000


        Q2             Q2              Q3               Q4
       2006           2007            2007             2008

              Total Objects Stored in Amazon S3
                                                        6
Why Are People So Excited ?
Cloud Architectures - Jinesh Varia - GrepTheWeb
Most Companies Worry About This

Your Idea    Undifferentiated         Successful
             “Heavy Lifting”           Product
                Power/Cooling
            Hardware Management
            Bandwidth Management
              Contract Negotiations
                  Maintenance
                  Deployment
              Purchasing Decisions
             Load Balancing/Scaling
               Managing Growth
70/30 Switch
Focus on Innovation

 Your Idea    Undifferentiated   Successful
              “Heavy Lifting”     Product




             Cloud Computing
Amazon Cloud Computing

 Elastic Unlimited Capacity      Get Big Fast

      Pay As You Go           Spend Cash Wisely

   Simple, Reliable, Fast     Focus On Your Idea
Amazon             Amazon
EC2                SQS



 Amazon   Amazon     Amazon

   S3     Simple     EC2-
            DB       EBS
ANIMOTO.COM
Scale: 50 servers to 5000 servers in 3 days

                                                                       Amazon EC2 easily scaled
                                                                       to handle additional traffic
       Number of EC2 Instances


                                                                       Peak of 5000 instances




                                     Launch of Facebook modification.


                                     Steady state of ~40 instances



                                 4/12/2008   4/13/2008   4/14/2008   4/15/2008   4/16/2008   4/17/2008   4/18/2008   4/19/2008   4/20/2008
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
“TimesMachine” from NY Times

                  1851-1922 Articles
                  TIFF -> PDF
                  Input: 11 Million
                  Articles (4TB of data)
                  What did he do ?
                    100 EC2 Instances for
                    24 hours
                    All data on S3
                    Output: 1.5 TB of Data
                    Hadoop, iText, JetS3t
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
26
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
CS290F : Scalable Internet Services
                 USCB Fall 2006
                   Prof created an app to manage team
                   usage
                   Ruby on Rails
                   Complete Stack: From Load balancer,
                   App Server to DB
                   Learn how to scale: Simulated load
                   Generated Graphs
                   All course contents, students
                   assignments, lessons learned are on the
                   Wiki
CS345a : Data Mining @ Stanford

  Tools used:            Class organization:
    Shell/Linux/Java     Stanford Winter 2007
    Hadoop on EC2          30-35 Students
    Data set on S3         Each Team spawns 10-
    Datasets :NetFlix,     15 Hadoop slave nodes
    Alexa, IR datasets     TA created Getting-
    from TREC              Started AMIs (& scripts)
                           TA managed the
                           students usage
Bioinformatics @ Northwestern University


  • Using Hadoop to perform sequence
    alignments on large genomic datasets
    – Northwestern University (Flatow & Lin) presented
      a talk at the Next-gen Sequencing Data Analysis
      meeting
       • “An understanding of the industrial strength map-
         reduce paradigm will be invaluable to those looking to
         cope with the next-generation datasets. Combined with
         the power of elastic computing clouds, many of the
         potential barriers to dealing with such large-scale data
         can be completely eliminated.”

                               31
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures


Hardware
Infrastructure/Cost
            Job execution time




                                 time
Shrink your processing time
 CPUs




                              time
Shrink your processing time
 CPUs




                              time
Main Problems

             • How to co-ordinate jobs
               between machines             Hadoop
               (distributed processing) ?
 Technical   • What if a machine fails ?      Web
             • How will I Scale-out ?       Services


             • How do I get management
               signoff ?
             • Resources to manage the        Cloud
 Business      infrastructure?              Computing
             • How do I get rid of the
               Idle Infrastructure?
GrepTheWeb
What’s so cool about GrepTheWeb ?




   RegEx
                 WWW
Examples of Patterns
    Source Code
      int x = 40 + i
    Any thing with punctuation
      “Hey!” he said, “Are you ok?”
    Case Sensitive
      Function CallOrderController()
    Equations
      f(x) = x^2
    Other Patterns
      (dis)integration of life, Email Address
Zoom Level 1

               Input dataset (List
               of Document Urls)
                                     GetStatus
     RegEx            Alexa
                                     Subset of
                  GrepTheWeb         document
                     Service         URLs that
                                     matched
                                     the RegEx
Zoom Level 2
  Amazon SQS                                           Input Files
  Distributed Transient                                (Alexa Crawl)
  Buffer
                      Amazon S3
  Never Lose a messageInfinitely Scalable Storage in the cloud
        StartGrep
         RegEx                        Amazon
                          Highly         Amazon SimpleDB
                                   Available, Durable and Reliable
                                        SQS
  Ideal for small short-lived
    Amazon EC2                                 Database in the cloud
  messages Computing
    Resizable
                                                  Manage phases
                          Private and Public Storage
    Capacity in the cloud                   Controller
                                               Lightweight Query-able
  Access control          Pay by the GB
                              User info,       AttributeMonitor,
                                                    Launch, Store
    Spawn Server Instances    Job status info       Shutdown
  Message Locking
    using a Web Service call                   Distributed and
                                              Amazon
                                               Partitioned
                                                EC2
                           Amazon
    Root Level Access SimpleDB
         GetStatus                            Cluster    Input  Amazon
                                                                       Get Output
                             DB                          Output S3
                                               Pay by GB, Pay per
    Pay by the hour                            Query
Zoom Level 3
             Amazon SQS
                                                                        Billing
                                                                        Queue



StartGrep      Launch                            Monitor                      Shut                       Billing
               Queue                             Queue                       down                        Service
                                                                             Queue



                 Controller
                          Launch                       Monitor                       Shutdown            Billing
                          Controller                   Controller                    Controller          Controller

                                       launch                            Get EC2
                                                           ping          Info
                           Insert       Insert
                                        EC2                                            Check for
                           JobID,
                                        info                                           results
                           Status                                       Shutdown

                                                                    Master M
 GetStatus
                                                                     Slaves N                                         Get Output
                                                                    HDFS                            Output
                              Status                                                  Put
                                DB                                                    File

                                                                                                                      Input Files
                                                                                                                      (Alexa Crawl)
                                                                                                     Input
                                                                                     Get
                            Amazon                         Hadoop Cluster on         File
                           SimpleDB                          Amazon EC2
                                                                                                   Amazon S3
Zoom Level 4
                                  Combine
                          Map
             User1
             StartJob1    Map                        StopJob1
                          Map         Reduce

                          …..

   Service                Map                                   Store status
                          Tasks                                 and results
                                        Hadoop Job
                                                                               Get
                                                                               Result
                                  Combine
                          Map

                          Map

                          Map          Reduce
              User2                                  StopJob2
              StartJob2   …..

                          Map
                          Tasks
                                        Hadoop Job
SideTrack: WordCount Example
                                                               Input
 MAPPER: For each input record, extract
 a set of key/value pairs that we care                                 Input key
 about the each record                                                 value pairs
                                                               Map
 “Hi Hadoop, Bye Hadoop”

 (“Hi”, 1), (“Hadoop”, 1),                         key 1                    key 3
                                                                            Values..
 (“Bye”, 1), (“Hadoop”, 1)                         Values..


 REDUCER: For each extracted                                           Aggregate
                                               Key 1
 key/value pair, combine it with other         All Values..
 values that share the same key
                                                              Reduce
      (“Hadoop”, [1,1])

        (“Hadoop”, 2)                                             Final Key 1
                                                                  Values..

                     Source: Doug Cutting’s Slide Deck on Hadoop
Zoom Level 5 (Hadoop MapReduce)
 MAPPER: For each input record, extract a set                      Input

 of key/value pairs that we care about the each
                                                                           Input key
 record                                                                    value pairs

                                                                    Map
      (LineNumber, s3pointer)

                                                        key 1                     key 3
       (s3pointer, [matches])                           Values..                  Values..



                                                                            Aggregate
                                                  Key 1
                                                  All Values..
 REDUCER: For each extracted key/value pair,
 combine it with other values that share the
 same key                                                          Reduce


                                                                       Final Key 1 Values..
          Identity Function

                       Source: Doug Cutting’s Slide Deck on Hadoop
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb

More Related Content

PPTX
AWS (Amazon Redshift) presentation
PPTX
SQL, Embedded SQL, Dynamic SQL and SQLJ
PPT
Midpoint circle algo
PDF
2D Transformation in Computer Graphics
PPTX
Adbms 16 object definition language
PPTX
Mid-Point Cirle Drawing Algorithm
PPTX
Dag representation of basic blocks
PDF
loaders and linkers
AWS (Amazon Redshift) presentation
SQL, Embedded SQL, Dynamic SQL and SQLJ
Midpoint circle algo
2D Transformation in Computer Graphics
Adbms 16 object definition language
Mid-Point Cirle Drawing Algorithm
Dag representation of basic blocks
loaders and linkers

What's hot (20)

PPT
Composite transformations
PDF
Unit 3
PPTX
Seven step model of migration into the cloud
PPTX
Segments in Graphics
PPTX
Introduction to HDFS
PPTX
Bundled Attributes by R.Chinthamani.pptx
PPTX
Database Vs Data Warehouse Vs Data Lake : What Is the Difference
PDF
indexing and hashing
PPT
Hive(ppt)
PPTX
Bresenham circle
PPTX
Disk structure
PPT
Adaptive Huffman Coding
PDF
3D transformation - Unit 3 Computer grpahics
PDF
Operating Systems - Implementing File Systems
PPTX
PPTX
Polyglot Persistence
PPTX
XML - Data Modeling
PDF
Dynamics AX/ X++
PPTX
Output primitives in Computer Graphics
Composite transformations
Unit 3
Seven step model of migration into the cloud
Segments in Graphics
Introduction to HDFS
Bundled Attributes by R.Chinthamani.pptx
Database Vs Data Warehouse Vs Data Lake : What Is the Difference
indexing and hashing
Hive(ppt)
Bresenham circle
Disk structure
Adaptive Huffman Coding
3D transformation - Unit 3 Computer grpahics
Operating Systems - Implementing File Systems
Polyglot Persistence
XML - Data Modeling
Dynamics AX/ X++
Output primitives in Computer Graphics
Ad

Viewers also liked (20)

PDF
The Cloud as a Platform
PDF
Anaptixi didaskalias mikromathimatos [λειτουργία συμβατότητας]
PDF
อีเลิร์นนิ่งสำหรับผู้บริหารโรงเรียนสังกัด กทม.
PDF
Innovation, Service, and Shared References
PPT
Catavento cultural 31
PDF
EPiServer Update October 2013
PPTX
Spain performance assessment of students
PDF
Of brains and buttons (UXCE, Berlin, Germany)
PPTX
Italian Version: Disasters 2.0: Collaborazione in Tempo Reale: Documentazione...
PDF
His m07t03c
PDF
Ochoa marmex
PDF
Content, context, and community
PPT
Ellis Island History
PPS
Lo Que Se Puede Hacer
PDF
Textile Storyboard Version 3 Guru
PPT
Pledge Drive Workshop
ODP
Situational Awareness 2.0 #EMAG2011
PPT
PPTX
Presentacion evaluation third project meeting in spain
The Cloud as a Platform
Anaptixi didaskalias mikromathimatos [λειτουργία συμβατότητας]
อีเลิร์นนิ่งสำหรับผู้บริหารโรงเรียนสังกัด กทม.
Innovation, Service, and Shared References
Catavento cultural 31
EPiServer Update October 2013
Spain performance assessment of students
Of brains and buttons (UXCE, Berlin, Germany)
Italian Version: Disasters 2.0: Collaborazione in Tempo Reale: Documentazione...
His m07t03c
Ochoa marmex
Content, context, and community
Ellis Island History
Lo Que Se Puede Hacer
Textile Storyboard Version 3 Guru
Pledge Drive Workshop
Situational Awareness 2.0 #EMAG2011
Presentacion evaluation third project meeting in spain
Ad

Similar to Cloud Architectures - Jinesh Varia - GrepTheWeb (16)

PDF
Jeff barr Seattle_interactive_2011_q4
PDF
Rethinking the cloud_-_limitations_and_oppotunities_-_2011_nexcom
PPTX
Cloud computing with AWS
PDF
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
PPTX
AWS Boot Camp in Taipei
PDF
Carlos Condè - Amazon Web Services
PDF
Jeff Barr Amazon Services Cloud Computing
PDF
Masterworks talk on Big Data and the implications of petascale science
PPTX
How to run your Hadoop Cluster in 10 minutes
PDF
[Jun AWS 201] Technical Workshop
PPTX
CloudStack-Development-Story
PDF
AWS re:Invent 2016 recap (part 1)
PDF
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
PPTX
NWCloud Cloud Track - Best Practices for Architecting in the Cloud
PDF
What is Amazon Web Services & How to Start to deploy your apps ?
PDF
AMAZON CLOUD Course Content
Jeff barr Seattle_interactive_2011_q4
Rethinking the cloud_-_limitations_and_oppotunities_-_2011_nexcom
Cloud computing with AWS
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
AWS Boot Camp in Taipei
Carlos Condè - Amazon Web Services
Jeff Barr Amazon Services Cloud Computing
Masterworks talk on Big Data and the implications of petascale science
How to run your Hadoop Cluster in 10 minutes
[Jun AWS 201] Technical Workshop
CloudStack-Development-Story
AWS re:Invent 2016 recap (part 1)
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
NWCloud Cloud Track - Best Practices for Architecting in the Cloud
What is Amazon Web Services & How to Start to deploy your apps ?
AMAZON CLOUD Course Content

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Cloud Architectures - Jinesh Varia - GrepTheWeb

  • 2. On Cloud Computing…. “We in academia and the government labs have not kept up with the times, Universities really need to get on board.” - Randal E. Bryant, Dean of the Computer Science School at Carnegie Mellon University. source: http://www.nytimes.com/2007/10/08/technology/08cloud.html
  • 4. Amazon.com and AWS Bandwidth consumed by Amazon Web Services Bandwidth consumed by Amazon’s global websites 2001 1996 2002 1997 2003 1998 2004 1999 2005 2000 2006 2001 2007 2002 2008
  • 5. AWS Customer Momentum (490,000) Q1 2006 Q1 2007 Q1 2008 Q4 2008 0 100 200 300 400 500 600
  • 6. Amazon S3 Momentum 800,000,000 5,000,000,000 10,000,000,000 40,000,000,000 Q2 Q2 Q3 Q4 2006 2007 2007 2008 Total Objects Stored in Amazon S3 6
  • 7. Why Are People So Excited ?
  • 9. Most Companies Worry About This Your Idea Undifferentiated Successful “Heavy Lifting” Product Power/Cooling Hardware Management Bandwidth Management Contract Negotiations Maintenance Deployment Purchasing Decisions Load Balancing/Scaling Managing Growth
  • 11. Focus on Innovation Your Idea Undifferentiated Successful “Heavy Lifting” Product Cloud Computing
  • 12. Amazon Cloud Computing Elastic Unlimited Capacity Get Big Fast Pay As You Go Spend Cash Wisely Simple, Reliable, Fast Focus On Your Idea
  • 13. Amazon Amazon EC2 SQS Amazon Amazon Amazon S3 Simple EC2- DB EBS
  • 15. Scale: 50 servers to 5000 servers in 3 days Amazon EC2 easily scaled to handle additional traffic Number of EC2 Instances Peak of 5000 instances Launch of Facebook modification. Steady state of ~40 instances 4/12/2008 4/13/2008 4/14/2008 4/15/2008 4/16/2008 4/17/2008 4/18/2008 4/19/2008 4/20/2008
  • 21. “TimesMachine” from NY Times 1851-1922 Articles TIFF -> PDF Input: 11 Million Articles (4TB of data) What did he do ? 100 EC2 Instances for 24 hours All data on S3 Output: 1.5 TB of Data Hadoop, iText, JetS3t
  • 26. 26
  • 29. CS290F : Scalable Internet Services USCB Fall 2006 Prof created an app to manage team usage Ruby on Rails Complete Stack: From Load balancer, App Server to DB Learn how to scale: Simulated load Generated Graphs All course contents, students assignments, lessons learned are on the Wiki
  • 30. CS345a : Data Mining @ Stanford Tools used: Class organization: Shell/Linux/Java Stanford Winter 2007 Hadoop on EC2 30-35 Students Data set on S3 Each Team spawns 10- Datasets :NetFlix, 15 Hadoop slave nodes Alexa, IR datasets TA created Getting- from TREC Started AMIs (& scripts) TA managed the students usage
  • 31. Bioinformatics @ Northwestern University • Using Hadoop to perform sequence alignments on large genomic datasets – Northwestern University (Flatow & Lin) presented a talk at the Next-gen Sequencing Data Analysis meeting • “An understanding of the industrial strength map- reduce paradigm will be invaluable to those looking to cope with the next-generation datasets. Combined with the power of elastic computing clouds, many of the potential barriers to dealing with such large-scale data can be completely eliminated.” 31
  • 36. Shrink your processing time CPUs time
  • 37. Shrink your processing time CPUs time
  • 38. Main Problems • How to co-ordinate jobs between machines Hadoop (distributed processing) ? Technical • What if a machine fails ? Web • How will I Scale-out ? Services • How do I get management signoff ? • Resources to manage the Cloud Business infrastructure? Computing • How do I get rid of the Idle Infrastructure?
  • 40. What’s so cool about GrepTheWeb ? RegEx WWW
  • 41. Examples of Patterns Source Code int x = 40 + i Any thing with punctuation “Hey!” he said, “Are you ok?” Case Sensitive Function CallOrderController() Equations f(x) = x^2 Other Patterns (dis)integration of life, Email Address
  • 42. Zoom Level 1 Input dataset (List of Document Urls) GetStatus RegEx Alexa Subset of GrepTheWeb document Service URLs that matched the RegEx
  • 43. Zoom Level 2 Amazon SQS Input Files Distributed Transient (Alexa Crawl) Buffer Amazon S3 Never Lose a messageInfinitely Scalable Storage in the cloud StartGrep RegEx Amazon Highly Amazon SimpleDB Available, Durable and Reliable SQS Ideal for small short-lived Amazon EC2 Database in the cloud messages Computing Resizable Manage phases Private and Public Storage Capacity in the cloud Controller Lightweight Query-able Access control Pay by the GB User info, AttributeMonitor, Launch, Store Spawn Server Instances Job status info Shutdown Message Locking using a Web Service call Distributed and Amazon Partitioned EC2 Amazon Root Level Access SimpleDB GetStatus Cluster Input Amazon Get Output DB Output S3 Pay by GB, Pay per Pay by the hour Query
  • 44. Zoom Level 3 Amazon SQS Billing Queue StartGrep Launch Monitor Shut Billing Queue Queue down Service Queue Controller Launch Monitor Shutdown Billing Controller Controller Controller Controller launch Get EC2 ping Info Insert Insert EC2 Check for JobID, info results Status Shutdown Master M GetStatus Slaves N Get Output HDFS Output Status Put DB File Input Files (Alexa Crawl) Input Get Amazon Hadoop Cluster on File SimpleDB Amazon EC2 Amazon S3
  • 45. Zoom Level 4 Combine Map User1 StartJob1 Map StopJob1 Map Reduce ….. Service Map Store status Tasks and results Hadoop Job Get Result Combine Map Map Map Reduce User2 StopJob2 StartJob2 ….. Map Tasks Hadoop Job
  • 46. SideTrack: WordCount Example Input MAPPER: For each input record, extract a set of key/value pairs that we care Input key about the each record value pairs Map “Hi Hadoop, Bye Hadoop” (“Hi”, 1), (“Hadoop”, 1), key 1 key 3 Values.. (“Bye”, 1), (“Hadoop”, 1) Values.. REDUCER: For each extracted Aggregate Key 1 key/value pair, combine it with other All Values.. values that share the same key Reduce (“Hadoop”, [1,1]) (“Hadoop”, 2) Final Key 1 Values.. Source: Doug Cutting’s Slide Deck on Hadoop
  • 47. Zoom Level 5 (Hadoop MapReduce) MAPPER: For each input record, extract a set Input of key/value pairs that we care about the each Input key record value pairs Map (LineNumber, s3pointer) key 1 key 3 (s3pointer, [matches]) Values.. Values.. Aggregate Key 1 All Values.. REDUCER: For each extracted key/value pair, combine it with other values that share the same key Reduce Final Key 1 Values.. Identity Function Source: Doug Cutting’s Slide Deck on Hadoop