SlideShare a Scribd company logo
Hadoop
          Data Analytics in the Cloud

          Mike Olson
          Chief Executive Officer




Friday, July 17, 2009
Hadoop History

          ▪   Doug Cutting worked on Nutch (web-scale crawler-based
              search), 2002-2004
          ▪   Google published MapReduce paper in 2004
              ▪   Cutting adds DFS & MapReduce support to Nutch
              ▪   Joined by Mike Cafarella
          ▪   2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
              ▪   Web-scale deployments in 2007, 2008 at Y!, Facebook, others
          ▪   Today: 22 committers to core project
              ▪   Related projects: HBase, Hive, Pig, Mahout, Hama and others


Friday, July 17, 2009
Why Hadoop?

          ▪   Large web properties invented MapReduce for large-scale,
              reliable, inexpensive analytics
          ▪   Enterprises generally need these techniques
              ▪   Retail, financial services, oil and gas, health care, green
                  technologies and more
          ▪   Hardware trends driving toward long-term retention of valuable
              source data
              ▪   New analytical tools are required
          ▪   Hadoop complements current-generation data warehousing and
              analytical products


Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight




Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight

          ▪   Simulations and Scientific/Experimental Data
              ▪   genome sequencing, medical imaging, wireless sensors




Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight

          ▪   Simulations and Scientific/Experimental Data
              ▪   genome sequencing, medical imaging, wireless sensors
          ▪   Existing Databases
              ▪   product catalogs, historical sales data, transaction histories




Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight

          ▪   Simulations and Scientific/Experimental Data
              ▪   genome sequencing, medical imaging, wireless sensors
          ▪   Existing Databases
              ▪   product catalogs, historical sales data, transaction histories
          ▪   User Data
              ▪   web logs, clicks on website, pictures, videos, bbs, etc




Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight

          ▪   Simulations and Scientific/Experimental Data
              ▪   genome sequencing, medical imaging, wireless sensors
          ▪   Existing Databases
              ▪   product catalogs, historical sales data, transaction histories
          ▪   User Data
              ▪   web logs, clicks on website, pictures, videos, bbs, etc
          ▪   System Generated Data
              ▪   1000’s of systems reporting status every second




Friday, July 17, 2009
Where Does Data Come From?
          Many Sources Provide Deeper Insight

          ▪   Simulations and Scientific/Experimental Data
              ▪   genome sequencing, medical imaging, wireless sensors
          ▪   Existing Databases
              ▪   product catalogs, historical sales data, transaction histories
          ▪   User Data
              ▪   web logs, clicks on website, pictures, videos, bbs, etc
          ▪   System Generated Data
              ▪   1000’s of systems reporting status every second
          ▪   Data Comes in All Shapes, Sizes, Schemas and Structures
              ▪   Hadoop combines many sources regardless of format and structure


Friday, July 17, 2009
Hadoop Technical Overview: HDFS
          Storing Data: Distributed Over Many Machines




                          HDFS:   Hadoop Distributed File System




Friday, July 17, 2009
Hadoop Technical Overview: HDFS
          Storing Data: Distributed Over Many Machines




                          HDFS:   Hadoop Distributed File System




Friday, July 17, 2009
Hadoop Technical Overview: HDFS
          Storing Data: Distributed Over Many Machines




                              Commodity Servers




                          HDFS:   Hadoop Distributed File System




Friday, July 17, 2009
Hadoop Technical Overview: HDFS
          Storing Data: Distributed Over Many Machines




                                   Commodity Servers

                          Files are broken into blocks and distributed across all
                          servers. Replication protects data from hardware failure.


                          HDFS:       Hadoop Distributed File System




Friday, July 17, 2009
Hadoop Technical Overview: MapReduce
          Processing Data: Leveraging Data Locality




                           MapReduce

Friday, July 17, 2009
Hadoop Technical Overview: MapReduce
          Processing Data: Leveraging Data Locality




                           MapReduce

Friday, July 17, 2009
Hadoop Technical Overview: MapReduce
          Processing Data: Leveraging Data Locality




                           MapReduce

Friday, July 17, 2009
Hadoop Technical Overview: MapReduce
          Processing Data: Leveraging Data Locality




                           Data elements processed locally, in parallel
                        Reliable computation implicitly managed by Hadoop


                                        MapReduce

Friday, July 17, 2009
Hadoop Technical Overview: Reliability
          Fault Tolerance: Handled with Software




                        Software Fault Tolerance

Friday, July 17, 2009
Hadoop Technical Overview: Reliability
          Fault Tolerance: Handled with Software




                        Software Fault Tolerance

Friday, July 17, 2009
Hadoop Technical Overview: Reliability
          Fault Tolerance: Handled with Software




                   Data loss prevented through automatic replication and rebalancing
                        Computation is restarted automatically without user intervention


                                     Software Fault Tolerance

Friday, July 17, 2009
Cloud Deployment Options for Hadoop
          ▪   In your data center
              •   Acquire, provision, administer servers
              •   Choose a virtualization infrastructure?
          ▪   On dedicated, hosted services
              •   Scale up or down by coordinating with your MSP
          • On          dynamic web services (AWS and others)
              •   Spin up, use, shut down a cluster


          • Issues:

              •   Data persistence and location, organizational control


Friday, July 17, 2009
(c) 1009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved.




Friday, July 17, 2009

More Related Content

PDF
Redbooks with live links 2010 12-06
PDF
Hadoop and HBase in the Real World
PDF
Demystifying Data Science
PPT
Web Crawling and Data Gathering with Apache Nutch
PPTX
Fraud Detection Architecture
PPTX
Deep Learning for Fraud Detection
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Introduction to BIg Data and Hadoop
Redbooks with live links 2010 12-06
Hadoop and HBase in the Real World
Demystifying Data Science
Web Crawling and Data Gathering with Apache Nutch
Fraud Detection Architecture
Deep Learning for Fraud Detection
Building a Scalable Web Crawler with Hadoop
Introduction to BIg Data and Hadoop

Similar to Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud (20)

PDF
Semantic web meetup 14.november 2013
PDF
Apache hadoop bigdata-in-banking
PPT
Hadoop_Its_Not_Just_Internal_Storage_V14
PPTX
PPTX
201305 hadoop jpl-v3
PDF
Data Evolution in HBase
PPTX
EMC config Hadoop
KEY
Intro To Hadoop
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
ODP
Hadoop @ Sara & BiG Grid
ODP
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
PPTX
Introduction To Big Data & Hadoop
PDF
Presentation architecting virtualized infrastructure for big data
PDF
Presentation architecting virtualized infrastructure for big data
PDF
Introduction to Hadoop
PDF
20100806 cloudera 10 hadoopable problems webinar
PDF
10 Common Hadoop-able Problems Webinar
Semantic web meetup 14.november 2013
Apache hadoop bigdata-in-banking
Hadoop_Its_Not_Just_Internal_Storage_V14
201305 hadoop jpl-v3
Data Evolution in HBase
EMC config Hadoop
Intro To Hadoop
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
Hadoop @ Sara & BiG Grid
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Introduction To Big Data & Hadoop
Presentation architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
Introduction to Hadoop
20100806 cloudera 10 hadoopable problems webinar
10 Common Hadoop-able Problems Webinar
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Ad

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
August Patch Tuesday
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Unlock new opportunities with location data.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Architecture types and enterprise applications.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Getting started with AI Agents and Multi-Agent Systems
Web Crawler for Trend Tracking Gen Z Insights.pptx
August Patch Tuesday
Univ-Connecticut-ChatGPT-Presentaion.pdf
The various Industrial Revolutions .pptx
sustainability-14-14877-v2.pddhzftheheeeee
DP Operators-handbook-extract for the Mautical Institute
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
1 - Historical Antecedents, Social Consideration.pdf
A review of recent deep learning applications in wood surface defect identifi...
CloudStack 4.21: First Look Webinar slides
A novel scalable deep ensemble learning framework for big data classification...
Benefits of Physical activity for teenagers.pptx
Unlock new opportunities with location data.pdf
Zenith AI: Advanced Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
NewMind AI Weekly Chronicles – August ’25 Week III
A comparative study of natural language inference in Swahili using monolingua...
Architecture types and enterprise applications.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud

  • 1. Hadoop Data Analytics in the Cloud Mike Olson Chief Executive Officer Friday, July 17, 2009
  • 2. Hadoop History ▪ Doug Cutting worked on Nutch (web-scale crawler-based search), 2002-2004 ▪ Google published MapReduce paper in 2004 ▪ Cutting adds DFS & MapReduce support to Nutch ▪ Joined by Mike Cafarella ▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch ▪ Web-scale deployments in 2007, 2008 at Y!, Facebook, others ▪ Today: 22 committers to core project ▪ Related projects: HBase, Hive, Pig, Mahout, Hama and others Friday, July 17, 2009
  • 3. Why Hadoop? ▪ Large web properties invented MapReduce for large-scale, reliable, inexpensive analytics ▪ Enterprises generally need these techniques ▪ Retail, financial services, oil and gas, health care, green technologies and more ▪ Hardware trends driving toward long-term retention of valuable source data ▪ New analytical tools are required ▪ Hadoop complements current-generation data warehousing and analytical products Friday, July 17, 2009
  • 4. Where Does Data Come From? Many Sources Provide Deeper Insight Friday, July 17, 2009
  • 5. Where Does Data Come From? Many Sources Provide Deeper Insight ▪ Simulations and Scientific/Experimental Data ▪ genome sequencing, medical imaging, wireless sensors Friday, July 17, 2009
  • 6. Where Does Data Come From? Many Sources Provide Deeper Insight ▪ Simulations and Scientific/Experimental Data ▪ genome sequencing, medical imaging, wireless sensors ▪ Existing Databases ▪ product catalogs, historical sales data, transaction histories Friday, July 17, 2009
  • 7. Where Does Data Come From? Many Sources Provide Deeper Insight ▪ Simulations and Scientific/Experimental Data ▪ genome sequencing, medical imaging, wireless sensors ▪ Existing Databases ▪ product catalogs, historical sales data, transaction histories ▪ User Data ▪ web logs, clicks on website, pictures, videos, bbs, etc Friday, July 17, 2009
  • 8. Where Does Data Come From? Many Sources Provide Deeper Insight ▪ Simulations and Scientific/Experimental Data ▪ genome sequencing, medical imaging, wireless sensors ▪ Existing Databases ▪ product catalogs, historical sales data, transaction histories ▪ User Data ▪ web logs, clicks on website, pictures, videos, bbs, etc ▪ System Generated Data ▪ 1000’s of systems reporting status every second Friday, July 17, 2009
  • 9. Where Does Data Come From? Many Sources Provide Deeper Insight ▪ Simulations and Scientific/Experimental Data ▪ genome sequencing, medical imaging, wireless sensors ▪ Existing Databases ▪ product catalogs, historical sales data, transaction histories ▪ User Data ▪ web logs, clicks on website, pictures, videos, bbs, etc ▪ System Generated Data ▪ 1000’s of systems reporting status every second ▪ Data Comes in All Shapes, Sizes, Schemas and Structures ▪ Hadoop combines many sources regardless of format and structure Friday, July 17, 2009
  • 10. Hadoop Technical Overview: HDFS Storing Data: Distributed Over Many Machines HDFS: Hadoop Distributed File System Friday, July 17, 2009
  • 11. Hadoop Technical Overview: HDFS Storing Data: Distributed Over Many Machines HDFS: Hadoop Distributed File System Friday, July 17, 2009
  • 12. Hadoop Technical Overview: HDFS Storing Data: Distributed Over Many Machines Commodity Servers HDFS: Hadoop Distributed File System Friday, July 17, 2009
  • 13. Hadoop Technical Overview: HDFS Storing Data: Distributed Over Many Machines Commodity Servers Files are broken into blocks and distributed across all servers. Replication protects data from hardware failure. HDFS: Hadoop Distributed File System Friday, July 17, 2009
  • 14. Hadoop Technical Overview: MapReduce Processing Data: Leveraging Data Locality MapReduce Friday, July 17, 2009
  • 15. Hadoop Technical Overview: MapReduce Processing Data: Leveraging Data Locality MapReduce Friday, July 17, 2009
  • 16. Hadoop Technical Overview: MapReduce Processing Data: Leveraging Data Locality MapReduce Friday, July 17, 2009
  • 17. Hadoop Technical Overview: MapReduce Processing Data: Leveraging Data Locality Data elements processed locally, in parallel Reliable computation implicitly managed by Hadoop MapReduce Friday, July 17, 2009
  • 18. Hadoop Technical Overview: Reliability Fault Tolerance: Handled with Software Software Fault Tolerance Friday, July 17, 2009
  • 19. Hadoop Technical Overview: Reliability Fault Tolerance: Handled with Software Software Fault Tolerance Friday, July 17, 2009
  • 20. Hadoop Technical Overview: Reliability Fault Tolerance: Handled with Software Data loss prevented through automatic replication and rebalancing Computation is restarted automatically without user intervention Software Fault Tolerance Friday, July 17, 2009
  • 21. Cloud Deployment Options for Hadoop ▪ In your data center • Acquire, provision, administer servers • Choose a virtualization infrastructure? ▪ On dedicated, hosted services • Scale up or down by coordinating with your MSP • On dynamic web services (AWS and others) • Spin up, use, shut down a cluster • Issues: • Data persistence and location, organizational control Friday, July 17, 2009
  • 22. (c) 1009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. Friday, July 17, 2009