SlideShare a Scribd company logo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Improving MySQL Performance with
Hadoop
Sagar Jauhari, Manish Kumar
  Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
India
                                                                       May 03 – May 04, 2012

                                                                       San Francisco
                                                                       September 30 – October 4, 2012




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Program Agenda

●   Introduction
●   Inside Hadoop!
●   Integration with MySQL
●   Facebook's usage of MySQL & Hadoop
●   Twitter's usage of MySQL &Hadoop




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL
   ●          12 million product installations
   ●          65,000 downloads each day
   ●          Part of the rapidly growing open source LAMP stack
   ●          MySQL Commercial Editions Available




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop
   ●          Highly scalable Distributed Framework
                 ○          Yahoo! has a 4000 node cluster!
   ●          Extremely powerful in terms of computation
                 ○          Sorts a TB of random integers in 62 seconds!




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop is ..
   ●          A scalable system for data storage and processing.
   ●          Fault tolerant
   ●          Parallelizes data processing across many nodes
   ●          Leverages its distributed file system (HDFS)* to
              cheaply and reliably replicate chunks of data.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Who uses Hadoop?
 ● Yahoo:
                          ■         Ad Systems and Web Search.
 ● Facebook:
                          ■         Reporting/analytics and machine learning.
 ● Twitter:
                          ■         Data warehousing, data analysis.
 ● Netflix:
                          ■         Movie recommendation algorithm uses Hive ( which uses
                                    Hadoop, HDFS & MapReduce underneath)


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL Vs Hadoop
                                                                       MySQL                        Hadoop

Data Capacity                                                          TB+ (may require sharding)   PB+

Data per query                                                         GB?                          PB+

Read/Write                                                             Random read/write            Sequential scans, Append - only

Query Language                                                         SQL                          Java MapReduce, scripting
                                                                                                    languages, Hive QL

Transaction                                                            Yes                          No

Indexes                                                                Yes                          No

Latence                                                                Sub-second (hopefully)       Minutes to hours

Data structure                                                         Structured                   Structured or unstructured
Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop


                                                                       A shallow Deep Dive


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         A distributed, scalable,                                       Name Node

              and portable file system
              written in Java
    ●         Each node in a Hadoop                                             HDFS

              instance typically has a
              single name-node; a
              cluster of data-nodes form
              the HDFS cluster.
                                                                       Map / Reduce Workers



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         Uses the TCP/IP layer for                                      Name Node

              communication
    ●         Stores large files across
              multiple machines                                                 HDFS

    ●         Single name node stores
              metadata in-memory.


                                                                       Map / Reduce Workers



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Design Goals
                  ○         Scalability
                  ○         Cost Efficiency
    ●         Implementation
                  ○         User Jobs are executed as 'map' and 'reduce' functions
                  ○         Work distribution and fault tolerance are managed


            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Map
                  ○         Map Reduce job splits input data into independent chunks
                  ○         Each chunk is processed by the map task in a parallel
                            manner
                  ○         Generic key-value computation




            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Reduce
                  ○         Data from data nodes is merge sorted so that the key-value
                            pairs for a given key are contiguous
                  ○         The merged data is read sequentially and the values are
                            passed to the reduce method with an iterator reading the
                            input file until the next key value is encountered



            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
How does hadoop use Map-Reduce
    ●         Framework consists of a single master JobTracker
              and one slave TaskTracker per cluster-node.
    ●         Master
                  ○         Schedules the jobs' component tasks on the slaves
                  ○         Monitors the jobs
                  ○         Re-executes the failed tasks
    ●         Slave
                  ○         Executes the tasks as directed by the master.



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Why Map Reduce ?
    ●         Language support
                  ○            Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) .
    ●         Scales Horizontally
    ●         Programmer is isolated from individual failed tasks
             ○         Tasks are restarted on another node




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce Limitations
    ●         Not a good fit for problems that exhibit task-driven
              parallelism.
    ●         Requires a particular form of input - a set of (key,
              pair) pairs.
    ●         A lot of MapReduce applications end up sharing data
              one way or another.



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                           Leveraging Hadoop to
                                                                                Improve MySQL
                                                                                    performance


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

●     The benefits of MySQL to developers is the speed,
      reliability, data integrity and scalability it provides.
●     It can successfully process large amounts of data (in
      petabytes).
●     But for applications that require a massive parallel
      processing we may need the benefits of a parallel
      processing system, such as hadoop.



    Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
  Problem Statement
Word Count Problem
 ● In a large set of
   documents, find the
   number of occurrences
   of each word.




  Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Word count problem
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Mapping

                                                                         Key and Value represent a row of data:
Map
                                                                           key is the byte office, value in a line.
(key,
value)
                                                                        Intermediate Output
foreach                                                                <word1>, 1
(word in                                                               <word2>, 1
the                                                                    <word3>, 1
value)

output
(word,1)

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Reducing
                                                                       Hadoop aggregates the keys
Reduce                                                                 and calls reduce for each
(key, list)                                                            unique key:
  sum                                                                   <word1>, (1,1,1,1,1,1…1)
the list                                                                <word2>, (1,1,1)
  Output                                                                <word3>, (1,1,1,1,1,1) .
(key,
                                              Final result:
sum)
                                          <word1>, 45823
                                          <word2>, 1204
                                          <word3>, 2693



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                                       Demo




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Video




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop

● Facebook collects TB of data everyday from around 800 million
  users.
● MySQL handles pretty much every user interaction: likes,
  shares, status updates, alerts, requests, etc.
● Hadoop/Hive Warehouse
  – 4800 cores, 2 PetaBytes (July 2009)
  – 4800 cores, 12 PetaBytes (Sept 2009)
● Hadoop Archival Store
  – 200 TB



 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop
Hive
    ●         Data warehouse system for Hadoop.
    ●         Facilitates easy data summarization.
    ●         Hive translates HiveQL to MapReduce code.
    ●         Querying
                  ○         Provides a mechanism to project structure onto this data
                  ○         Allows querying the data using a SQL-like language called HiveQL




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Hive Vs SQL

                                                                             RDBMS                        HIVE

                                                                             SQL-92 standard (maybe)      Subset of SQL-92 plus Hive-
           Language
                                                                                                          specific extension
                                                                             INSERT, UPDATE and           INSERT but not UPDATE or
           Update Capabilities
                                                                             DELETE                       DELETE

                                                                             Yes                          No
           Transactions

                                                                             Sub-Second                   Minutes or more
           Latency

                                                                             Any number of indexes,       No indexes, data is always
           Indexes
                                                                             very                         scanned (in parallel)
                                                                             important for performance
                                                                             TBs                          PBs
           Data size
           Data per query                                                    GBs
          Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010   PBs


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Hadoop Implementation
At Twitter
    ●         > 12 terabytes of new data per day!
    ●         Most stored data is LZ0 compressed
    ●         Uses Scribe to write logs to Hadoop
                  ○         Scribe: a log collection framework created and open-
                            sourced by Facebook.
    ●         Hadoop used for data warehousing, data analysis.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
References

    ●         Leveraging Hadoop to Augment MySQL Deployments - Sarah
              Sproehnle, Cloudera
    ●         http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
    ●         http://semanticvoid.com
    ●         http://michael-noll.com
    ●         http://hadoop.apache.org/




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Legal Disclaimer

    ●         All other products, company names, brand names,
              trademarks and logos are the property of their
              respective owners.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Thank You


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

More Related Content

PDF
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
PPTX
Hive+Tez: A performance deep dive
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
HBaseCon 2015: HBase and Spark
PDF
Sql on everything with drill
PPTX
Hadoop Infrastructure @Uber Past, Present and Future
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Hive+Tez: A performance deep dive
Data Wrangling and Oracle Connectors for Hadoop
Apache Tez: Accelerating Hadoop Query Processing
HBaseCon 2015: HBase and Spark
Sql on everything with drill
Hadoop Infrastructure @Uber Past, Present and Future
Scaling HDFS to Manage Billions of Files with Key-Value Stores

What's hot (20)

PPTX
Hadoop And Their Ecosystem
PDF
Cloudera Impala
PPTX
NoSQL Needs SomeSQL
PPTX
Hadoop and rdbms with sqoop
PPTX
Architecting Applications with Hadoop
PDF
Apache Ratis - In Search of a Usable Raft Library
PPTX
Apache Tez : Accelerating Hadoop Query Processing
PDF
Big Data Journey
PDF
Syncsort et le retour d'expérience ComScore
PPTX
Operationalizing YARN based Hadoop Clusters in the Cloud
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PPTX
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PDF
2013 July 23 Toronto Hadoop User Group Hive Tuning
PDF
Integration of HIve and HBase
PDF
Applications on Hadoop
PPTX
February 2014 HUG : Pig On Tez
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PPTX
Apache Tez – Present and Future
Hadoop And Their Ecosystem
Cloudera Impala
NoSQL Needs SomeSQL
Hadoop and rdbms with sqoop
Architecting Applications with Hadoop
Apache Ratis - In Search of a Usable Raft Library
Apache Tez : Accelerating Hadoop Query Processing
Big Data Journey
Syncsort et le retour d'expérience ComScore
Operationalizing YARN based Hadoop Clusters in the Cloud
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
New Data Transfer Tools for Hadoop: Sqoop 2
2013 July 23 Toronto Hadoop User Group Hive Tuning
Integration of HIve and HBase
Applications on Hadoop
February 2014 HUG : Pig On Tez
Spark SQL versus Apache Drill: Different Tools with Different Rules
Apache Tez – Present and Future
Ad

Similar to Improving MySQL performance with Hadoop (20)

PPTX
Hadoop and mysql by Chris Schneider
PPTX
Above the cloud: Big Data and BI
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PDF
Hadoop Overview & Architecture
 
KEY
Introduction to Hadoop - ACCU2010
PPTX
Big Data & Hadoop Introduction
PDF
Data Processing in the Work of NoSQL? An Introduction to Hadoop
PPT
Hadoop by sunitha
PDF
Searching conversations with hadoop
PDF
Hadoop 101
 
PDF
Hw09 Hadoop Db
PDF
Hadoop on Azure, Blue elephants
PPTX
MapReduce Paradigm
PPTX
MapReduce Paradigm
PPT
Presentation
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PPTX
Big data hadoop ecosystem and nosql
PDF
Keynote from ApacheCon NA 2011
Hadoop and mysql by Chris Schneider
Above the cloud: Big Data and BI
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Hadoop Overview & Architecture
 
Introduction to Hadoop - ACCU2010
Big Data & Hadoop Introduction
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Hadoop by sunitha
Searching conversations with hadoop
Hadoop 101
 
Hw09 Hadoop Db
Hadoop on Azure, Blue elephants
MapReduce Paradigm
MapReduce Paradigm
Presentation
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Big data hadoop ecosystem and nosql
Keynote from ApacheCon NA 2011
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx

Improving MySQL performance with Hadoop

  • 1. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 2. Improving MySQL Performance with Hadoop Sagar Jauhari, Manish Kumar Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 3. India May 03 – May 04, 2012 San Francisco September 30 – October 4, 2012 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 4. Program Agenda ● Introduction ● Inside Hadoop! ● Integration with MySQL ● Facebook's usage of MySQL & Hadoop ● Twitter's usage of MySQL &Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 5. Introduction MySQL ● 12 million product installations ● 65,000 downloads each day ● Part of the rapidly growing open source LAMP stack ● MySQL Commercial Editions Available Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 6. Introduction Hadoop ● Highly scalable Distributed Framework ○ Yahoo! has a 4000 node cluster! ● Extremely powerful in terms of computation ○ Sorts a TB of random integers in 62 seconds! Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 7. Introduction Hadoop is .. ● A scalable system for data storage and processing. ● Fault tolerant ● Parallelizes data processing across many nodes ● Leverages its distributed file system (HDFS)* to cheaply and reliably replicate chunks of data. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 8. Introduction Who uses Hadoop? ● Yahoo: ■ Ad Systems and Web Search. ● Facebook: ■ Reporting/analytics and machine learning. ● Twitter: ■ Data warehousing, data analysis. ● Netflix: ■ Movie recommendation algorithm uses Hive ( which uses Hadoop, HDFS & MapReduce underneath) Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 9. Introduction MySQL Vs Hadoop MySQL Hadoop Data Capacity TB+ (may require sharding) PB+ Data per query GB? PB+ Read/Write Random read/write Sequential scans, Append - only Query Language SQL Java MapReduce, scripting languages, Hive QL Transaction Yes No Indexes Yes No Latence Sub-second (hopefully) Minutes to hours Data structure Structured Structured or unstructured Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 10. Inside Hadoop A shallow Deep Dive Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 11. Inside Hadoop HDFS ● A distributed, scalable, Name Node and portable file system written in Java ● Each node in a Hadoop HDFS instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. Map / Reduce Workers Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 12. Inside Hadoop HDFS ● Uses the TCP/IP layer for Name Node communication ● Stores large files across multiple machines HDFS ● Single name node stores metadata in-memory. Map / Reduce Workers Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 13. Inside Hadoop HDFS Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 14. Inside Hadoop Map Reduce ● Design Goals ○ Scalability ○ Cost Efficiency ● Implementation ○ User Jobs are executed as 'map' and 'reduce' functions ○ Work distribution and fault tolerance are managed Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 15. Inside Hadoop Map Reduce ● Map ○ Map Reduce job splits input data into independent chunks ○ Each chunk is processed by the map task in a parallel manner ○ Generic key-value computation Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 16. Inside Hadoop Map Reduce ● Reduce ○ Data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous ○ The merged data is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 17. Inside Hadoop Map Reduce Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 18. Inside Hadoop How does hadoop use Map-Reduce ● Framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. ● Master ○ Schedules the jobs' component tasks on the slaves ○ Monitors the jobs ○ Re-executes the failed tasks ● Slave ○ Executes the tasks as directed by the master. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 19. Inside Hadoop Why Map Reduce ? ● Language support ○ Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) . ● Scales Horizontally ● Programmer is isolated from individual failed tasks ○ Tasks are restarted on another node Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 20. Inside Hadoop Map Reduce Limitations ● Not a good fit for problems that exhibit task-driven parallelism. ● Requires a particular form of input - a set of (key, pair) pairs. ● A lot of MapReduce applications end up sharing data one way or another. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 21. Integration with MySQL Leveraging Hadoop to Improve MySQL performance Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 22. Integration with MySQL ● The benefits of MySQL to developers is the speed, reliability, data integrity and scalability it provides. ● It can successfully process large amounts of data (in petabytes). ● But for applications that require a massive parallel processing we may need the benefits of a parallel processing system, such as hadoop. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 23. Integration with MySQL Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 24. Integration with MySQL Problem Statement Word Count Problem ● In a large set of documents, find the number of occurrences of each word. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 25. Integration with MySQL Word count problem Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 26. Integration with MySQL Mapping Key and Value represent a row of data: Map key is the byte office, value in a line. (key, value) Intermediate Output foreach <word1>, 1 (word in <word2>, 1 the <word3>, 1 value) output (word,1) Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 27. Integration with MySQL Reducing Hadoop aggregates the keys Reduce and calls reduce for each (key, list) unique key: sum <word1>, (1,1,1,1,1,1…1) the list <word2>, (1,1,1) Output <word3>, (1,1,1,1,1,1) . (key, Final result: sum) <word1>, 45823 <word2>, 1204 <word3>, 2693 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 28. Integration with MySQL Demo Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 29. Integration with MySQL Video Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 30. Facebook's usage of MySQL & Hadoop ● Facebook collects TB of data everyday from around 800 million users. ● MySQL handles pretty much every user interaction: likes, shares, status updates, alerts, requests, etc. ● Hadoop/Hive Warehouse – 4800 cores, 2 PetaBytes (July 2009) – 4800 cores, 12 PetaBytes (Sept 2009) ● Hadoop Archival Store – 200 TB Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 31. Facebook's usage of MySQL & Hadoop Hive ● Data warehouse system for Hadoop. ● Facilitates easy data summarization. ● Hive translates HiveQL to MapReduce code. ● Querying ○ Provides a mechanism to project structure onto this data ○ Allows querying the data using a SQL-like language called HiveQL Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 32. Facebook's usage of MySQL & Hadoop Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 33. Hive Vs SQL RDBMS HIVE SQL-92 standard (maybe) Subset of SQL-92 plus Hive- Language specific extension INSERT, UPDATE and INSERT but not UPDATE or Update Capabilities DELETE DELETE Yes No Transactions Sub-Second Minutes or more Latency Any number of indexes, No indexes, data is always Indexes very scanned (in parallel) important for performance TBs PBs Data size Data per query GBs Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 PBs Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 34. Hadoop Implementation At Twitter ● > 12 terabytes of new data per day! ● Most stored data is LZ0 compressed ● Uses Scribe to write logs to Hadoop ○ Scribe: a log collection framework created and open- sourced by Facebook. ● Hadoop used for data warehousing, data analysis. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 35. References ● Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera ● http://engineering.twitter.com/2010/04/hadoop-at-twitter.html ● http://semanticvoid.com ● http://michael-noll.com ● http://hadoop.apache.org/ Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 36. Legal Disclaimer ● All other products, company names, brand names, trademarks and logos are the property of their respective owners. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 37. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 38. Thank You Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 39. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.