SlideShare a Scribd company logo
MapReduce with Apache Hadoop
Analysing Big Data


April 2010
Gavin Heavyside
gavin.heavyside@journeydynamics.com
About Journey Dynamics
    •   Founded in 2006 to develop software technology to address the issues of
        congestion, fuel efficiency, driving safety and eco-driving
    •   Based in the Surrey Technology Centre, Guildford, UK
    •   Analyse large amounts (TB) of GPS data from cars, vans & trucks
    •   TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week
        for every link in the road network
    •   MyDrive® - Unique & sophisticated system that learns how drivers behave
         Drivers can improve fuel economy
         Insurance companies can understand driver risk
         Navigation devices can improved route choice & ETA
         Fleet managers can monitor their fleet to improve safety & eco-driving




                                                                          © 2010 Journey Dynamics Ltd
2
Big Data

    • Data volumes increasing
    • NYSE: 1TB new trade data/day
    • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL
    • LHC: 15PB data/year
    • Facebook: several TB photos uploaded/day




                                                             © 2010 Journey Dynamics Ltd
3
“Medium” Data
    • Most of us arenʼt at Google or Facebook scale
    • But: data at the GB/TB scale is becoming more common
    • Outgrow conventional databases
    • Disks are cheap, but slow




                        • 1TB drive - £50
                        • 2.5 hours to read 1TB at 100MB/s




                                                      © 2010 Journey Dynamics Ltd
4
Two Challenges

    • Managing lots of data

    • Doing something useful with it




                                       © 2010 Journey Dynamics Ltd
5
Managing Lots of Data

    • Access and analyse any or all of your data
    • SAN technologies (FC, iSCSI, NFS)
    • Querying (MySQL, PostgreSQL, Oracle)



    ➡ Cost, network bandwidth, concurrent access, resilience

    ➡ When you have 1000s of nodes, MTBF < 1 day




                                                        © 2010 Journey Dynamics Ltd
6
Analysing Lots of Data
    • Parallel processing
    • HPC
    • Grid Computing
    • MPI
    • Sharding

    ➡ Too big for memory, specialised HW, complex, scalability
    ➡ Hardware reliability in large clusters




                                                        © 2010 Journey Dynamics Ltd
7
Apache Hadoop
    • Reliable, scalable distributed computing platform

    • HDFS - high throughput fault-tolerant distributed file system
    • MapReduce - fault-tolerant distributed processing

    • Runs on commodity hardware
    • Cost-effective

    • Open source (Apache License)




                                                          © 2010 Journey Dynamics Ltd
8
Hadoop History
    • 2003-2004 Google publishes MapReduce & GFS papers
    • 2004 Doug Cutting add DFS & MapReduce to Nutch
    • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch
    • Jan 2008 - top level Apache project
    • April 2010: 95 companies on PoweredBy Hadoop wiki
    • Yahoo!, Twitter, Facebook, Microsoft, New York Times,
    LinkedIn, Last.fm, IBM, Baidu, Adobe
            "The name my kid gave a stuffed yellow elephant.
            Short, relatively easy to spell and pronounce,
            meaningless, and not used elsewhere: those are my
            naming criteria. Kids are good at generating such.
            Googol is a kid's term"
            Doug Cutting


                                                                 © 2010 Journey Dynamics Ltd
9
Hadoop Ecosystem
     • HDFS
     • MapReduce

     • HBase
     • ZooKeeper
     • Pig
     • Hive

     • Chukwa
     • Avro




                        © 2010 Journey Dynamics Ltd
10
Anatomy of a Hadoop Cluster


         Namenode          Tasktracker
                            Tasktracker
                             Tasktracker
                              Tasktracker
                            Datanode        Rack 1
                             Datanode
                              Datanode
                               Datanode


          JobTracker
                           Tasktracker
                            Tasktracker
                             Tasktracker
                              Tasktracker
                            Datanode
                             Datanode       Rack n
                              Datanode
                               Datanode




                                               © 2010 Journey Dynamics Ltd
11
HDFS
     • Reliable shared storage
     • Modelled after GFS
     • Very large files
     • Streaming data access
     • Commodity Hardware
     • Replication
     • Tolerate regular hardware failure




                                           © 2010 Journey Dynamics Ltd
12
HDFS
     • Block size 64MB
     • Default replication factor = 3


              1
              2     HDFS         1      2   3   4   5
              3                  2      3   4   5   1
              4                  3      4   5   1   2
              5




                                                        © 2010 Journey Dynamics Ltd
13
HDFS
     • Block size 64MB
     • Default replication factor = 3


              1
              2     HDFS         1      2   3   4   5
              3                  2      3   4   5   1
              4                  3      4   5   1   2
              5




                                                        © 2010 Journey Dynamics Ltd
13
MapReduce
     • Based on 2004 Google paper
     • Concepts from Functional Programming
     • Used for lots of things within Google (and now everywhere)
     • Parallel Map => Shuffle & Sort => Parallel Reduce
     • Easy to understand and write MapReduce programs
     • Move the computation to the data
     • Rack-aware
     • Linear Scalability
     • Works with HDFS, S3, KFS, file:// and more




                                                          © 2010 Journey Dynamics Ltd
14
MapReduce
     • “Single Threaded” MapReduce:
             cat input/* | map | sort | reduce > output

     • Map program parses the input and emits [key,value] pairs
     • Sort by key
     • Reduce computes output from values with same key

                                                  Reduce
                Map              Sort




     • Extrapolate to PB of data on thousands of nodes


                                                           © 2010 Journey Dynamics Ltd
15
MapReduce
     • Distributed Example
                       sort
       Split 0
       HDFS      Map          copy


                                     merge
                       sort
       Split 1                                        part 0
       HDFS      Map                         Reduce   HDFS


                                     merge

                                                      part 1
                                             Reduce   HDFS

                       sort
       Split n
       HDFS      Map




                                                         © 2010 Journey Dynamics Ltd
16
MapReduce can be good for:
     • “Embarrassingly Parallel” problems
     • Semi-structured or unstructured data
     • Index generation
     • Log analysis
     • Statistical analysis of patterns in data
     • Image processing
     • Generating map tiles
     • Data Mining
     • Much, much more




                                                  © 2010 Journey Dynamics Ltd
17
MapReduce is not be good for:
     • Real-time or low-latency queries
     • Some graph algorithms
     • Algorithms that canʼt be split into independent chunks
     • Some types of joins*
     • Not a replacement for RDBMS




     * Can be tricky to write unless you use an abstraction e.g. Pig, Hive




                                                                             © 2010 Journey Dynamics Ltd
18
Writing MapReduce Programs
     • Java
     • Pipes (C++, sockets)
     • Streaming
     • Frameworks, e.g. wukong(ruby), dumbo(python)
     • JVM languages e.g. JRuby, Clojure, Scala
     • Cascading.org
     • Cascalog
     • Pig
     • Hive




                                                      © 2010 Journey Dynamics Ltd
19
Streaming Example (ruby)
     • mapper.rb




     • reducer.rb




                                © 2010 Journey Dynamics Ltd
20
Pig
     • High level language for writing data analysis programs
     • Runs MapReduce jobs
     • Joins, grouping, filtering, sorting, statistical functions
     • User-defined functions
     • Optional schemas
     • Sampling
     • Pig Latin similar to imperative language, define steps to run




                                                           © 2010 Journey Dynamics Ltd
21
Pig Example




                   © 2010 Journey Dynamics Ltd
22
Hive
     • Data warehousing and querying
     • HiveQL - SQL-like language for querying data
     • Runs MapReduce jobs
     • Joins, grouping, filtering, sorting, statistical functions
     • Partitioning of data
     • User-defined functions
     • Sampling
     • Declarative syntax




                                                                   © 2010 Journey Dynamics Ltd
23
Hive Example




                    © 2010 Journey Dynamics Ltd
24
Getting Started
     • http://hadoop.apache.org
     • Cloudera Distribution (VM, source, rpm, deb)
     • Elastic MapReduce

     • Cloudera VM
     • Pseudo-distributed cluster




                                                      © 2010 Journey Dynamics Ltd
25
Learn More
     • http://hadoop.apache.org
     • Books




     • Mailing Lists
     • Commercial Support & Training, e.g. Cloudera




                                                      © 2010 Journey Dynamics Ltd
26
Related
     • Cassandra 0.6 has Hadoop integration - run MapReduce
     jobs against data in Cassandra
     • NoSQL DBs with MapReduce functionality include
     CouchDB, MongoDB, Riak and more
     • RDBMS with MapReduce include Aster, Greenplum,
     HadoopDB and more




                                                      © 2010 Journey Dynamics Ltd
27
Gavin Heavyside
     gavin.heavyside@journeydynamics.com
          www.journeydynamics.com




                                           © 2010 Journey Dynamics Ltd
28

More Related Content

KEY
Non-Relational Databases at ACCU2011
PPTX
NoSQL and The Big Data Hullabaloo
PPTX
Microsoft's Big Play for Big Data
PPTX
Big Data Strategy for the Relational World
PDF
Developing polyglot persistence applications #javaone 2012
PDF
Big Data and NoSQL in Microsoft-Land
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PPTX
Lviv EDGE 2 - NoSQL
Non-Relational Databases at ACCU2011
NoSQL and The Big Data Hullabaloo
Microsoft's Big Play for Big Data
Big Data Strategy for the Relational World
Developing polyglot persistence applications #javaone 2012
Big Data and NoSQL in Microsoft-Land
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Lviv EDGE 2 - NoSQL

What's hot (19)

PPTX
Big Data and NoSQL for Database and BI Pros
KEY
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
PDF
Infinispan - Galder Zamarreno - October 2010
PDF
Developing polyglot persistence applications (SpringOne China 2012)
PPTX
Infinispan, transactional key value data grid and nosql database
PDF
Scaing databases on the cloud
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
KEY
North Bay Ruby Meetup 101911
PPTX
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
PPTX
Big Data: Guidelines and Examples for the Enterprise Decision Maker
PPTX
Drilling into Data with Apache Drill
PDF
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
PPTX
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
PDF
HBaseCon2017 Apache HBase at Didi
PDF
The Evolution of Open Source Databases
PPTX
Hadoop Training in Hyderabad
PPTX
La big datacamp2014_vikram_dixit
PPTX
NoSQL in Real-time Architectures
Big Data and NoSQL for Database and BI Pros
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Infinispan - Galder Zamarreno - October 2010
Developing polyglot persistence applications (SpringOne China 2012)
Infinispan, transactional key value data grid and nosql database
Scaing databases on the cloud
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
North Bay Ruby Meetup 101911
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Drilling into Data with Apache Drill
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
HBaseCon2017 Apache HBase at Didi
The Evolution of Open Source Databases
Hadoop Training in Hyderabad
La big datacamp2014_vikram_dixit
NoSQL in Real-time Architectures
Ad

Viewers also liked (20)

PDF
06 05 kingdom a-right luke 6 20-31 final
PDF
Relihiyon Ng Allah
DOC
Fotos de Articulos 1
PPTX
Film Noir Presenation
PDF
Faith &Practices
PPTX
C:\Fakepath\Balik Tanaw
PPT
De Dag Dat Alles Beter Is Proloog
PPTX
AS FOUNDATION PRODUCTION, EVALUATION
PPTX
2012 08 24 backbone_2
PDF
Urbanism São Paulo
PPTX
Film Techniques
PPTX
Voice thread tutorial
PPTX
Margarita Carranza Torres N L 5
PDF
Location Location Location
PPSX
E:\Documents And Settings\Administrador\Mis Documentos\Arreglo De Registro
PPT
Serv box金點設計獎簡報 v2.0
PDF
Berkeley Campus Map
PPT
Linkedin Gen Script
06 05 kingdom a-right luke 6 20-31 final
Relihiyon Ng Allah
Fotos de Articulos 1
Film Noir Presenation
Faith &Practices
C:\Fakepath\Balik Tanaw
De Dag Dat Alles Beter Is Proloog
AS FOUNDATION PRODUCTION, EVALUATION
2012 08 24 backbone_2
Urbanism São Paulo
Film Techniques
Voice thread tutorial
Margarita Carranza Torres N L 5
Location Location Location
E:\Documents And Settings\Administrador\Mis Documentos\Arreglo De Registro
Serv box金點設計獎簡報 v2.0
Berkeley Campus Map
Linkedin Gen Script
Ad

Similar to Introduction to Hadoop - ACCU2010 (20)

KEY
Introduction to Hadoop - ACCU2010
PPTX
10c introduction
PPTX
10c introduction
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PPTX
Big Data in the Microsoft Platform
PPTX
Introduction to hadoop V2
PDF
Introduction to Hadoop and Big Data Processing
PPTX
Drill njhug -19 feb2013
PDF
Hadoop on Azure, Blue elephants
PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
PPTX
PhillyDB Talk - Beyond Batch
PDF
Hd insight essentials quick view
PDF
HdInsight essentials Hadoop on Microsoft Platform
PDF
Hd insight essentials quick view
PPTX
Yarnthug2014
PPTX
MHUG - YARN
PDF
Geospatial Big Data - Foss4gNA
PDF
Apache Spark Overview
PPTX
The Hadoop Ecosystem
Introduction to Hadoop - ACCU2010
10c introduction
10c introduction
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Big Data in the Microsoft Platform
Introduction to hadoop V2
Introduction to Hadoop and Big Data Processing
Drill njhug -19 feb2013
Hadoop on Azure, Blue elephants
Drill into Drill – How Providing Flexibility and Performance is Possible
PhillyDB Talk - Beyond Batch
Hd insight essentials quick view
HdInsight essentials Hadoop on Microsoft Platform
Hd insight essentials quick view
Yarnthug2014
MHUG - YARN
Geospatial Big Data - Foss4gNA
Apache Spark Overview
The Hadoop Ecosystem

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release

Introduction to Hadoop - ACCU2010

  • 1. MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com
  • 2. About Journey Dynamics • Founded in 2006 to develop software technology to address the issues of congestion, fuel efficiency, driving safety and eco-driving • Based in the Surrey Technology Centre, Guildford, UK • Analyse large amounts (TB) of GPS data from cars, vans & trucks • TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week for every link in the road network • MyDrive® - Unique & sophisticated system that learns how drivers behave Drivers can improve fuel economy Insurance companies can understand driver risk Navigation devices can improved route choice & ETA Fleet managers can monitor their fleet to improve safety & eco-driving © 2010 Journey Dynamics Ltd 2
  • 3. Big Data • Data volumes increasing • NYSE: 1TB new trade data/day • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL • LHC: 15PB data/year • Facebook: several TB photos uploaded/day © 2010 Journey Dynamics Ltd 3
  • 4. “Medium” Data • Most of us arenʼt at Google or Facebook scale • But: data at the GB/TB scale is becoming more common • Outgrow conventional databases • Disks are cheap, but slow • 1TB drive - £50 • 2.5 hours to read 1TB at 100MB/s © 2010 Journey Dynamics Ltd 4
  • 5. Two Challenges • Managing lots of data • Doing something useful with it © 2010 Journey Dynamics Ltd 5
  • 6. Managing Lots of Data • Access and analyse any or all of your data • SAN technologies (FC, iSCSI, NFS) • Querying (MySQL, PostgreSQL, Oracle) ➡ Cost, network bandwidth, concurrent access, resilience ➡ When you have 1000s of nodes, MTBF < 1 day © 2010 Journey Dynamics Ltd 6
  • 7. Analysing Lots of Data • Parallel processing • HPC • Grid Computing • MPI • Sharding ➡ Too big for memory, specialised HW, complex, scalability ➡ Hardware reliability in large clusters © 2010 Journey Dynamics Ltd 7
  • 8. Apache Hadoop • Reliable, scalable distributed computing platform • HDFS - high throughput fault-tolerant distributed file system • MapReduce - fault-tolerant distributed processing • Runs on commodity hardware • Cost-effective • Open source (Apache License) © 2010 Journey Dynamics Ltd 8
  • 9. Hadoop History • 2003-2004 Google publishes MapReduce & GFS papers • 2004 Doug Cutting add DFS & MapReduce to Nutch • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch • Jan 2008 - top level Apache project • April 2010: 95 companies on PoweredBy Hadoop wiki • Yahoo!, Twitter, Facebook, Microsoft, New York Times, LinkedIn, Last.fm, IBM, Baidu, Adobe "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term" Doug Cutting © 2010 Journey Dynamics Ltd 9
  • 10. Hadoop Ecosystem • HDFS • MapReduce • HBase • ZooKeeper • Pig • Hive • Chukwa • Avro © 2010 Journey Dynamics Ltd 10
  • 11. Anatomy of a Hadoop Cluster Namenode Tasktracker Tasktracker Tasktracker Tasktracker Datanode Rack 1 Datanode Datanode Datanode JobTracker Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Rack n Datanode Datanode © 2010 Journey Dynamics Ltd 11
  • 12. HDFS • Reliable shared storage • Modelled after GFS • Very large files • Streaming data access • Commodity Hardware • Replication • Tolerate regular hardware failure © 2010 Journey Dynamics Ltd 12
  • 13. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  • 14. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  • 15. MapReduce • Based on 2004 Google paper • Concepts from Functional Programming • Used for lots of things within Google (and now everywhere) • Parallel Map => Shuffle & Sort => Parallel Reduce • Easy to understand and write MapReduce programs • Move the computation to the data • Rack-aware • Linear Scalability • Works with HDFS, S3, KFS, file:// and more © 2010 Journey Dynamics Ltd 14
  • 16. MapReduce • “Single Threaded” MapReduce: cat input/* | map | sort | reduce > output • Map program parses the input and emits [key,value] pairs • Sort by key • Reduce computes output from values with same key Reduce Map Sort • Extrapolate to PB of data on thousands of nodes © 2010 Journey Dynamics Ltd 15
  • 17. MapReduce • Distributed Example sort Split 0 HDFS Map copy merge sort Split 1 part 0 HDFS Map Reduce HDFS merge part 1 Reduce HDFS sort Split n HDFS Map © 2010 Journey Dynamics Ltd 16
  • 18. MapReduce can be good for: • “Embarrassingly Parallel” problems • Semi-structured or unstructured data • Index generation • Log analysis • Statistical analysis of patterns in data • Image processing • Generating map tiles • Data Mining • Much, much more © 2010 Journey Dynamics Ltd 17
  • 19. MapReduce is not be good for: • Real-time or low-latency queries • Some graph algorithms • Algorithms that canʼt be split into independent chunks • Some types of joins* • Not a replacement for RDBMS * Can be tricky to write unless you use an abstraction e.g. Pig, Hive © 2010 Journey Dynamics Ltd 18
  • 20. Writing MapReduce Programs • Java • Pipes (C++, sockets) • Streaming • Frameworks, e.g. wukong(ruby), dumbo(python) • JVM languages e.g. JRuby, Clojure, Scala • Cascading.org • Cascalog • Pig • Hive © 2010 Journey Dynamics Ltd 19
  • 21. Streaming Example (ruby) • mapper.rb • reducer.rb © 2010 Journey Dynamics Ltd 20
  • 22. Pig • High level language for writing data analysis programs • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • User-defined functions • Optional schemas • Sampling • Pig Latin similar to imperative language, define steps to run © 2010 Journey Dynamics Ltd 21
  • 23. Pig Example © 2010 Journey Dynamics Ltd 22
  • 24. Hive • Data warehousing and querying • HiveQL - SQL-like language for querying data • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • Partitioning of data • User-defined functions • Sampling • Declarative syntax © 2010 Journey Dynamics Ltd 23
  • 25. Hive Example © 2010 Journey Dynamics Ltd 24
  • 26. Getting Started • http://hadoop.apache.org • Cloudera Distribution (VM, source, rpm, deb) • Elastic MapReduce • Cloudera VM • Pseudo-distributed cluster © 2010 Journey Dynamics Ltd 25
  • 27. Learn More • http://hadoop.apache.org • Books • Mailing Lists • Commercial Support & Training, e.g. Cloudera © 2010 Journey Dynamics Ltd 26
  • 28. Related • Cassandra 0.6 has Hadoop integration - run MapReduce jobs against data in Cassandra • NoSQL DBs with MapReduce functionality include CouchDB, MongoDB, Riak and more • RDBMS with MapReduce include Aster, Greenplum, HadoopDB and more © 2010 Journey Dynamics Ltd 27
  • 29. Gavin Heavyside gavin.heavyside@journeydynamics.com www.journeydynamics.com © 2010 Journey Dynamics Ltd 28

Editor's Notes

  • #4: Andrei Alexandrescu - 1300ish photos/second at facebook
  • #7: Use only subset/sample of data A server might have a MTBF of a few years With thousands you can expect failures at least daily System needs to be handle HW failures
  • #9: Explain what we mean by commodity hardware 8-core, 16-24GB ram, 4x1TB disks, gig ethernet
  • #10: 95 Companies not exhaustive!
  • #11: Zookeeper - distributed synchronisation service Locking, race conditions. Used for HBase distributed column-oriented database Chukwa is large-scale log collection &amp; analysis tools Avro - like protocol buffers - move towards default IPC mechanism in Hadoop
  • #12: Mention secondary namenode Namenode is point of failure - not highly available Mention lots of disks - JBOD: don&amp;#x2019;t stripe, don&amp;#x2019;t RAID Rack-aware - splits will be processed local to the HDFS blocks where possible. Pseudo distributed cluster can run all services on single machine DN/TT on lots of machines, NN, SNN, JT depend on cluster size.
  • #13: Optimised for large (GB) files 64MB block size not efficient for lots of small files -&gt; concatenate them
  • #15: Text data Binary data (protocol buffers, avro) Custom input/output formats
  • #17: Compare with previous example, splits on each node, shuffle to reduce nodes Talk about partitioning Reducers have to wait until the map and sort/shuffle has finished Combiners can run on Tasktracker Can be single Reducer or many
  • #18: Mention Speculative Execution Forward index: list of words per document Inverted Index: list of documents containing word Once you &amp;#x201C;get&amp;#x201D; it, you start to see how lots of problems can be parallelised Map Tiles at Google - Tech Talk on YouTube
  • #19: Algorithm has a strict data dependency between iterations? Might not be efficient to parallelise with MapReduce
  • #20: JVM languages can access the Hadoop Java classes Streaming works in any language that can read/write stdin/stdout Cascalog brand new; only heard about it last night
  • #21: Implements a map-side hash join
  • #22: SAMPLE command samples random parts of data SPLIT AVG, COUNT, CONCAT, MIN, MAX, SUM Custom input/output formats
  • #24: Hive developed at Facebook, contributed back to community SQL is a declarative language, you specify an outcome, and the query planner generates a sequence of steps. Ad-hoc queries using basic SQL Custom input/output formats
  • #25: Create Table Load Data Select/Join/Group
  • #27: Cloudera do developer training and sysadmin training Cloudera do commercial support Other providers on Hadoop homepage
  • #28: Cassandra is an alternative to HBase - seems to have more momentum behind it. Plenty of hobbyist/small-scale MapReduce frameworks out there