SlideShare a Scribd company logo
Why Hadoop for data science?

Ofer Mendelevitch
PASS BA Conference, April 2013




© Hortonworks Inc. 2013
A brief history of Apache Hadoop

                 Apache Project        Yahoo! begins to            Hortonworks
                  Established          Operate at scale            Data Platform

                                                                                             2013
   2004                   2006           2008             2010            2012            Enterprise
                                                                                           Hadoop
2005: Yahoo! creates
 team under E14 to                                             Focus on INNOVATION
  work on Hadoop

                          2008: Yahoo team extends focus to
                            operations to support multiple    Focus on OPERATIONS
                             projects & growing clusters


                                      2011: Hortonworks created to focus on
                                       “Enterprise Hadoop“. Starts with 24    STABILITY
                                        key Hadoop engineers from Yahoo



                                                                                               Page 2
           © Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce

Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store

• Map-reduce: distributed computation framework that
  handles the complexities of distributed programming




                                                        Page 3
      © Hortonworks Inc. 2013
Keys to Hadoop’s power
• Computation co-located with data
          – Data and computation system co-designed and co-
            developed to work together


• Process data in parallel across thousands of
  “commodity” hardware nodes
          – Self-healing; failure handled by software


• Designed for one write and multiple reads
          – There are no random writes
          – Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013                                 Page 4
HDP: Enterprise-Ready Hadoop
                     OPERATIONAL                     DATA
                       SERVICES                    SERVICES
                             Manage &
                              AMBARI      FLUME    Store, HIVE
                                                   PIG
                             Operate at         Process and       HBASE
                               Scale      SQOOP Access Data
                               OOZIE               HCATALOG

                                                    MAP REDUCE
                                          Distributed
                      HADOOP CORE         Storage & Processing
                                                      HDFS

                                              Enterprise Readiness: HA,
                      PLATFORM SERVICES       DR, Snapshots, Security, …

                                          HORTONWORKS
                                          DATA PLATFORM (HDP)

                         OS / VM           Cloud           Appliance




   © Hortonworks Inc. 2013
What is a
       data product?
© Hortonworks Inc. 2013   Page 6
What is a data product?


“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”


© Hortonworks Inc. 2013        Page 7
Example 1: Google Adwords




© Hortonworks Inc. 2013     Page 8
Example 2: People you may know




© Hortonworks Inc. 2013          Page 9
Example 3: spell correction




© Hortonworks Inc. 2013       Page 10
What is
      data science?
© Hortonworks Inc. 2013   Page 11
What is data science?

#1: Extracting deep meaning from data
(data mining; finding “gems” in data)




© Hortonworks Inc. 2013                 Page 12
Common data science tasks

                          Descriptive           Predictive

                            Clustering         Classification
               Detect natural groupings       Predict a category
                    Outlier detection           Regression
                          Detect anomalies     Predict a value
                     Affinity Analysis       Recommendation
                Co-occurrence patterns       Predict a preference




© Hortonworks Inc. 2013                                   Page 13
What is data science?

#2: Building data products
(Delivering gems on a regular basis)
                                                                         Online serving



                          Pre-process                 Build model             SQL




                          Periodic batch processing




© Hortonworks Inc. 2013                                             Page 14
Why Hadoop for data science?




Reason #1:
Explore full datasets



© Hortonworks Inc. 2013   Page 15
Explore large datasets directly with Hadoop

Researcher laptop
R, Matlab, SAS, etc




                                                  Measure/Evaluate




                            Model                                                 Acquire




                                                                                            Full dataset stored on Hadoop

                                Visualize, Grok                      Clean Data




  © Hortonworks Inc. 2013                                                                                  Page 16
Integrate Hadoop in your data analysis flow

• Exploratory data analysis on full dataset
         –Simple statistics: mean, median, quantile, etc
         –Pre-processing: grep, regex, etc


• Ad-hoc sampling / filtering
         –Random: with or without replacement
         –Sample by unique key
         –K-fold cross-validation




© Hortonworks Inc. 2013                                    Page 17
Why Hadoop for data science?




Reason #2:
Mine larger datasets



© Hortonworks Inc. 2013   Page 18
More data -> better outcomes




                   Banko & Brill, 2001   Halevy, Norvig & Pereira, 2009


© Hortonworks Inc. 2013                                  Page 19
Learning algorithms with large datasets…

Challenges:
• Data won’t fit in memory
• Learning takes a lot longer…


Using Hadoop:
• Distribute data across nodes in the Hadoop cluster
• Implement a distributed/parallel algorithm
         –Recommendation: Alternate Least Squares (ALS)
         –Clustering: K-means


© Hortonworks Inc. 2013                              Page 20
Why Hadoop for data science?




Reason #3:
Large-scale data preparation



© Hortonworks Inc. 2013   Page 21
80% of data science work is data preparation




                          Sampling, filtering

                                Joins

                                                Processed
           Raw Data       Entity resolution
                                                   Data
                             Strip away
                          HTML/PDF/DOC/P
                                 PT
                          Document vector
                            generation
                               Term
                            normalization


© Hortonworks Inc. 2013                         Page 22
Hadoop is ideal for batch data preparation and
cleanup of large datasets




© Hortonworks Inc. 2013                Page 23
Why Hadoop for data science?




Reason 4:
Accelerate data-driven innovation



© Hortonworks Inc. 2013   Page 24
Barriers to speed with traditional data architectures

• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation

                                                     Finally,               Let me
                I need
                                                     we start              see… is it
               new data
                                                    collecting             any good?



         Start                                    6 months        9 months




                          Schema change project



© Hortonworks Inc. 2013                                          Page 25
“Schema on read” means faster time-to-innovation

• Hadoop uses “schema on read”
• Low barrier for data-driven innovation

                                   Let me
                I need                             My model is
                                  see… is it
               new data                            awesome!
                                  any good?



         Start                3 months         6 months



                          Let’s just put
                          it in a folder
                             on HDFS


© Hortonworks Inc. 2013                                          Page 26
Summary



                          Why use Hadoop for data science?
                          1. Data exploration with full datasets
                          2. Mine larger datasets
                          3. Pre-processing at scale
                          4. Faster data-driven cycles




© Hortonworks Inc. 2013                                     Page 27
Quick start: Hortonworks Sandbox
• What is it
    – A free download of a virtualized single-node implementation of the enterprise-ready
      Hortonworks Data Platform
    – A personal Hadoop environment
    – An integrated learning environment with frequently, easily updatable hands-on
      step-by-step tutorials
• What it does
    – Dramatically accelerates the process of learning Apache Hadoop
    – Accelerate and validates the use of Hadoop within your unique data architecture
    – Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes

                                  Download Hortonworks Sandbox
                                  www.hortonworks.com/sandbox
                                  Sign up for Training for in-depth learning
                                  hortonworks.com/hadoop-training/




                                                                                    Page 28
        © Hortonworks Inc. 2013
Thank you!

                          Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks

Come visit us @ Booth S5
We’re hiring!

© Hortonworks Inc. 2013                   Page 29

More Related Content

PPSX
Big data with Hadoop - Introduction
PPTX
Big data analytics - hadoop
PDF
Introduction to Big Data Analytics on Apache Hadoop
PDF
Introduction to Big Data and Hadoop
PPTX
Whatisbigdataandwhylearnhadoop
PPTX
Introduction to BIg Data and Hadoop
PPTX
Big Data Analytics with Hadoop
PDF
Big Data Real Time Applications
Big data with Hadoop - Introduction
Big data analytics - hadoop
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data and Hadoop
Whatisbigdataandwhylearnhadoop
Introduction to BIg Data and Hadoop
Big Data Analytics with Hadoop
Big Data Real Time Applications

What's hot (20)

PDF
Introduction to Bigdata and HADOOP
PDF
Intro to HDFS and MapReduce
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PPTX
Big data analytics with hadoop volume 2
PDF
Big data technologies and Hadoop infrastructure
PPT
Big Data and Hadoop Basics
PDF
What is hadoop
PPTX
Hadoop and big data
PPTX
Hadoop: An Industry Perspective
PDF
Emergent Distributed Data Storage
PDF
Hadoop core concepts
PPT
Big data introduction, Hadoop in details
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
Rob peglar introduction_analytics _big data_hadoop
PDF
Hadoop,Big Data Analytics and More
PPTX
Learn Big Data & Hadoop
PDF
BI, Hive or Big Data Analytics?
PPT
Big Data: An Overview
PPTX
Introduction to Microsoft HDInsight and BI Tools
PDF
20100806 cloudera 10 hadoopable problems webinar
Introduction to Bigdata and HADOOP
Intro to HDFS and MapReduce
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Big data analytics with hadoop volume 2
Big data technologies and Hadoop infrastructure
Big Data and Hadoop Basics
What is hadoop
Hadoop and big data
Hadoop: An Industry Perspective
Emergent Distributed Data Storage
Hadoop core concepts
Big data introduction, Hadoop in details
Big Data Analytics with Hadoop, MongoDB and SQL Server
Rob peglar introduction_analytics _big data_hadoop
Hadoop,Big Data Analytics and More
Learn Big Data & Hadoop
BI, Hive or Big Data Analytics?
Big Data: An Overview
Introduction to Microsoft HDInsight and BI Tools
20100806 cloudera 10 hadoopable problems webinar
Ad

Viewers also liked (12)

PPTX
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
PPT
Data Science Day New York: Data Science: A Personal History
PPTX
Presentación univalle
PPTX
Data Science in the cloud with Microsoft Azure
PDF
5 Scenarios: When To Use & When Not to Use Hadoop
PPTX
Introduction to Hadoop Administration
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PPTX
Hadoop and Enterprise Data Warehouse
PPTX
Understanding the Lean Startup
PDF
Introduction to Data Science
KEY
Intro to Data Science for Enterprise Big Data
PPTX
Introduction to Big Data/Machine Learning
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Data Science Day New York: Data Science: A Personal History
Presentación univalle
Data Science in the cloud with Microsoft Azure
5 Scenarios: When To Use & When Not to Use Hadoop
Introduction to Hadoop Administration
Delivering Apache Hadoop for the Modern Data Architecture
Hadoop and Enterprise Data Warehouse
Understanding the Lean Startup
Introduction to Data Science
Intro to Data Science for Enterprise Big Data
Introduction to Big Data/Machine Learning
Ad

Similar to Why hadoop for data science? (20)

PDF
Hortonworks Big Data & Hadoop
PPTX
Apache Hadoop Now Next and Beyond
KEY
Agile analytics applications on hadoop
KEY
Hortonworks: Agile Analytics Applications
PPTX
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
KEY
UK - Agile Data Applications on Hadoop
PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
PPTX
201305 hadoop jpl-v3
PDF
Hadoop Trends
PDF
Introduction to Hadoop
PDF
Hadoop - Now, Next and Beyond
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
PDF
Hadoop Overview
 
PPTX
Introduction to Hortonworks Data Platform for Windows
PPTX
Ben Marden - Making sense of Big Data
PPTX
Strata feb2013
PDF
Building a Modern Data Architecture with Enterprise Hadoop
PPTX
Big Data Analytics - Is Your Elephant Enterprise Ready?
KEY
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Hortonworks Big Data & Hadoop
Apache Hadoop Now Next and Beyond
Agile analytics applications on hadoop
Hortonworks: Agile Analytics Applications
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
UK - Agile Data Applications on Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
201305 hadoop jpl-v3
Hadoop Trends
Introduction to Hadoop
Hadoop - Now, Next and Beyond
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
Hadoop Overview
 
Introduction to Hortonworks Data Platform for Windows
Ben Marden - Making sense of Big Data
Strata feb2013
Building a Modern Data Architecture with Enterprise Hadoop
Big Data Analytics - Is Your Elephant Enterprise Ready?
Utrecht NL-HUG/Data Science-NL - Agile Data Slides

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Why hadoop for data science?

  • 1. Why Hadoop for data science? Ofer Mendelevitch PASS BA Conference, April 2013 © Hortonworks Inc. 2013
  • 2. A brief history of Apache Hadoop Apache Project Yahoo! begins to Hortonworks Established Operate at scale Data Platform 2013 2004 2006 2008 2010 2012 Enterprise Hadoop 2005: Yahoo! creates team under E14 to Focus on INNOVATION work on Hadoop 2008: Yahoo team extends focus to operations to support multiple Focus on OPERATIONS projects & growing clusters 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 STABILITY key Hadoop engineers from Yahoo Page 2 © Hortonworks Inc. 2013
  • 3. Core Hadoop: HDFS & Map Reduce Deliver high-scale storage & processing • HDFS: distributed, self-healing data store • Map-reduce: distributed computation framework that handles the complexities of distributed programming Page 3 © Hortonworks Inc. 2013
  • 4. Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed and co- developed to work together • Process data in parallel across thousands of “commodity” hardware nodes – Self-healing; failure handled by software • Designed for one write and multiple reads – There are no random writes – Optimized for minimum seek on hard drives © Hortonworks Inc. 2013 Page 4
  • 5. HDP: Enterprise-Ready Hadoop OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013
  • 6. What is a data product? © Hortonworks Inc. 2013 Page 6
  • 7. What is a data product? “A software system whose core functionality depends on the application of statistical analysis and machine learning to data.” © Hortonworks Inc. 2013 Page 7
  • 8. Example 1: Google Adwords © Hortonworks Inc. 2013 Page 8
  • 9. Example 2: People you may know © Hortonworks Inc. 2013 Page 9
  • 10. Example 3: spell correction © Hortonworks Inc. 2013 Page 10
  • 11. What is data science? © Hortonworks Inc. 2013 Page 11
  • 12. What is data science? #1: Extracting deep meaning from data (data mining; finding “gems” in data) © Hortonworks Inc. 2013 Page 12
  • 13. Common data science tasks Descriptive Predictive Clustering Classification Detect natural groupings Predict a category Outlier detection Regression Detect anomalies Predict a value Affinity Analysis Recommendation Co-occurrence patterns Predict a preference © Hortonworks Inc. 2013 Page 13
  • 14. What is data science? #2: Building data products (Delivering gems on a regular basis) Online serving Pre-process Build model SQL Periodic batch processing © Hortonworks Inc. 2013 Page 14
  • 15. Why Hadoop for data science? Reason #1: Explore full datasets © Hortonworks Inc. 2013 Page 15
  • 16. Explore large datasets directly with Hadoop Researcher laptop R, Matlab, SAS, etc Measure/Evaluate Model Acquire Full dataset stored on Hadoop Visualize, Grok Clean Data © Hortonworks Inc. 2013 Page 16
  • 17. Integrate Hadoop in your data analysis flow • Exploratory data analysis on full dataset –Simple statistics: mean, median, quantile, etc –Pre-processing: grep, regex, etc • Ad-hoc sampling / filtering –Random: with or without replacement –Sample by unique key –K-fold cross-validation © Hortonworks Inc. 2013 Page 17
  • 18. Why Hadoop for data science? Reason #2: Mine larger datasets © Hortonworks Inc. 2013 Page 18
  • 19. More data -> better outcomes Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009 © Hortonworks Inc. 2013 Page 19
  • 20. Learning algorithms with large datasets… Challenges: • Data won’t fit in memory • Learning takes a lot longer… Using Hadoop: • Distribute data across nodes in the Hadoop cluster • Implement a distributed/parallel algorithm –Recommendation: Alternate Least Squares (ALS) –Clustering: K-means © Hortonworks Inc. 2013 Page 20
  • 21. Why Hadoop for data science? Reason #3: Large-scale data preparation © Hortonworks Inc. 2013 Page 21
  • 22. 80% of data science work is data preparation Sampling, filtering Joins Processed Raw Data Entity resolution Data Strip away HTML/PDF/DOC/P PT Document vector generation Term normalization © Hortonworks Inc. 2013 Page 22
  • 23. Hadoop is ideal for batch data preparation and cleanup of large datasets © Hortonworks Inc. 2013 Page 23
  • 24. Why Hadoop for data science? Reason 4: Accelerate data-driven innovation © Hortonworks Inc. 2013 Page 24
  • 25. Barriers to speed with traditional data architectures • RDBMS uses “schema on write”; change is expensive • High barrier for data-driven innovation Finally, Let me I need we start see… is it new data collecting any good? Start 6 months 9 months Schema change project © Hortonworks Inc. 2013 Page 25
  • 26. “Schema on read” means faster time-to-innovation • Hadoop uses “schema on read” • Low barrier for data-driven innovation Let me I need My model is see… is it new data awesome! any good? Start 3 months 6 months Let’s just put it in a folder on HDFS © Hortonworks Inc. 2013 Page 26
  • 27. Summary Why use Hadoop for data science? 1. Data exploration with full datasets 2. Mine larger datasets 3. Pre-processing at scale 4. Faster data-driven cycles © Hortonworks Inc. 2013 Page 27
  • 28. Quick start: Hortonworks Sandbox • What is it – A free download of a virtualized single-node implementation of the enterprise-ready Hortonworks Data Platform – A personal Hadoop environment – An integrated learning environment with frequently, easily updatable hands-on step-by-step tutorials • What it does – Dramatically accelerates the process of learning Apache Hadoop – Accelerate and validates the use of Hadoop within your unique data architecture – Use your data to explore and investigate your use cases • ZERO to big data in 15 minutes Download Hortonworks Sandbox www.hortonworks.com/sandbox Sign up for Training for in-depth learning hortonworks.com/hadoop-training/ Page 28 © Hortonworks Inc. 2013
  • 29. Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend, @hortonworks Come visit us @ Booth S5 We’re hiring! © Hortonworks Inc. 2013 Page 29

Editor's Notes

  • #3: Add 2007: formed first engineering team focused on this? Want to make point that 5 years of experienceAdd: timing that Cloudera bailedStarted as Nutch project at Yahoo, became Hadoop
  • #4: At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.