SlideShare a Scribd company logo
Data Science: A Personal History
    Jeff Hammerbacher




1
Data Scientist




2
Data Applications Scientist

    “I have only heard back from one person about that
    ‘Data Applications Scientist’ thing. I had anticipated
    more discussion” – me, February 29, 2008




3
“I guess I’m arguing for ‘Data’ to replace ‘Research’ in
    those titles (I am happy to drop the ‘Applications’) as
    the primary focus of our organization is not corporate
    research.” – me, March 1, 2008




4
Data Scientist

    “I’d like to avoid specialization at this early stage and I
    expect every member of our group to have a mix of
    research, engineering, and analysis in their workload.”
    – me, March 1, 2008




5
Facebook Data Team

    The Facebook Data Team built scalable platforms for
    the collection, management, and analysis of data.

    We used these platforms to drive informed decisions in
    areas critical to the success of the company and to
    build data-intensive products and services.




6
Data Science




7
Introduction to Data Science

    1.   Data Preparation
    2.   Data Presentation
    3.   Experimentation
    4.   Observation
    5.   Data Products




8
Data Scientist-Computer Symbiosis




9
Philosophy

     •   Instrument everything
     •   Put all of your data in one place
     •   Data first, questions later
     •   Store first, structure later
     •   Keep raw data forever
     •   Let everyone party on the data
     •   Produce tools to support the whole research cycle
     •   Modular and composable infrastructure


10
CDH

     •   Storage
          •   Append-only unstructured data
          •   Append-only tabular data
          •   Mutable tabular data




11
CDH

     •   Compute
          •   Resource management
          •   Parallel frameworks
          •   High-level interfaces
          •   Libraries




12
CDH

     •   Integration
          •   File system API
          •   Database API
          •   Batch data import/export
          •   Event data import
          •   User interface




13
Cloudera Products

     •   Subscription
          •   Proprietary software
          •   Support
     •   Training and Certification
     •   Services




14
Cloudera Deployment




        Application                Data
         Database     CD H       Warehouse




                               Business
                                            Analytics
                             Intelligence




15
Cloudera Workloads (Batch)

     •   Active archive
     •   Data reservoir
     •   ETL/ELT offload




16
Cloudera Workloads (Interactive)

     •   Application data delivery




17
Cloudera Customer Survey

     •   67% use Hive
     •   54% use HBase
     •   51% load data every 90 minutes or less
     •   71% move data from Hadoop to RDBMS for
         interactive SQL
     •   62% would like to consolidate into single platform




18
Cloudera Impala

     •   General-purpose SQL query engine
          •   Should work both for analytic and transactional workloads
          •   Will support queries that take from microseconds to hours




19
Cloudera Impala

     •   Runs directly within Hadoop
          •   Reads widely used Hadoop file formats
          •   Talks to widely used Hadoop storage managers
          •   Runs on same nodes that run Hadoop processes




20
Cloudera Impala

     •   High performance
          •   C++ instead of Java
          •   Runtime code generation
          •   Completely new execution engine—not MapReduce




21
Cloudera Impala

     •   Validated Beta Partners
          •   MicroStrategy
          •   QlikView
          •   Tableau
          •   Pentaho
          •   Karmasphere
          •   Capgemini




22
New Cloudera Workloads (Interactive)

     •   Operational reporting
     •   Ad hoc query




23
Cloudera Deployment




        Application                Data
         Database     CD H       Warehouse




                               Business
                                            Analytics
                             Intelligence




24
The Future




25
Potential Future Workloads

     •   Search
     •   MPI
     •   Stream processing
     •   Graph computations
     •   Linear algebra
     •   Optimization
     •   Simulation



26
The Last Mile

     •   Data libraries
     •   Language
     •   Libraries
     •   IDE for Data Scientists

     •   Mixed-initiative
     •   Memory
     •   Collaboration
     •   Model and analysis path selection
27
Doing Data Science

     •   More data sources
     •   More rows
     •   More columns (novel or derived)
     •   Better data quality
     •   Better outcomes
     •   Better loss functions
     •   Causal inference in observational studies
     •   Effect size estimates
     •   Meta-analysis
     •   Model lifecycle
28
29

More Related Content

PPTX
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
PPTX
Rethink Analytics with an Enterprise Data Hub
PPTX
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
PDF
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
PPTX
Enterprise Data Hub: The Next Big Thing in Big Data
PDF
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
PPTX
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
PDF
AURIN Data Hubs Supporting Smarter Cities - Phil Delaney, Locate14
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Rethink Analytics with an Enterprise Data Hub
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Enterprise Data Hub: The Next Big Thing in Big Data
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
AURIN Data Hubs Supporting Smarter Cities - Phil Delaney, Locate14

What's hot (18)

PDF
Building the Modern Data Hub: Beyond the Traditional Enterprise Data Warehouse
PPTX
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
PDF
Data Discovery and BI - Is there Really a Difference?
PPTX
Creating an Enterprise AI Strategy
PPTX
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
PPT
Emergence of MongoDB as an Enterprise Data Hub
PDF
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
PPTX
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
PDF
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
PPTX
2020 Big Data & Analytics Maturity Survey Results
PPTX
The Big Data Ecosystem for Financial Services
PPTX
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
PPTX
Introduction: Architecting for Scale
PDF
Webinar - Bringing Game Changing Insights with Graph Databases
PPTX
The Future of Data Management: The Enterprise Data Hub
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
PDF
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
PPTX
Webinar - Fighting Bank Fraud with Real-time Graph Database
Building the Modern Data Hub: Beyond the Traditional Enterprise Data Warehouse
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
Data Discovery and BI - Is there Really a Difference?
Creating an Enterprise AI Strategy
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Emergence of MongoDB as an Enterprise Data Hub
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
2020 Big Data & Analytics Maturity Survey Results
The Big Data Ecosystem for Financial Services
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Introduction: Architecting for Scale
Webinar - Bringing Game Changing Insights with Graph Databases
The Future of Data Management: The Enterprise Data Hub
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Webinar - Fighting Bank Fraud with Real-time Graph Database
Ad

Similar to Data Science Day New York: Data Science: A Personal History (20)

PDF
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
PDF
50 Shades of SQL
PDF
Making BD Work~TIAS_20150622
PDF
The Hadoop Ecosystem for Developers
PDF
Hadoop and the Data Warehouse: When to Use Which
PPTX
Architecting Your First Big Data Implementation
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PDF
Rapid Cluster Computing with Apache Spark 2016
PPTX
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
PPTX
Pacemaker hadoop infrastructure and soft serve experience
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PPTX
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
PDF
Advanced Analytics and Big Data (August 2014)
PDF
Hitachi Data Systems Hadoop Solution
PDF
Cloudera Search Webinar: Big Data Search, Bigger Insights
PPTX
Потоковая обработка больших данных
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PDF
Cray Urika-XA Advanced Analytics Platform
PPTX
Big data hadoop-no sql and graph db-final
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
50 Shades of SQL
Making BD Work~TIAS_20150622
The Hadoop Ecosystem for Developers
Hadoop and the Data Warehouse: When to Use Which
Architecting Your First Big Data Implementation
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Rapid Cluster Computing with Apache Spark 2016
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
Pacemaker hadoop infrastructure and soft serve experience
20160331 sa introduction to big data pipelining berlin meetup 0.3
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Advanced Analytics and Big Data (August 2014)
Hitachi Data Systems Hadoop Solution
Cloudera Search Webinar: Big Data Search, Bigger Insights
Потоковая обработка больших данных
Hadoop for Bioinformatics: Building a Scalable Variant Store
Cray Urika-XA Advanced Analytics Platform
Big data hadoop-no sql and graph db-final
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Data Science Day New York: Data Science: A Personal History

  • 1. Data Science: A Personal History Jeff Hammerbacher 1
  • 3. Data Applications Scientist “I have only heard back from one person about that ‘Data Applications Scientist’ thing. I had anticipated more discussion” – me, February 29, 2008 3
  • 4. “I guess I’m arguing for ‘Data’ to replace ‘Research’ in those titles (I am happy to drop the ‘Applications’) as the primary focus of our organization is not corporate research.” – me, March 1, 2008 4
  • 5. Data Scientist “I’d like to avoid specialization at this early stage and I expect every member of our group to have a mix of research, engineering, and analysis in their workload.” – me, March 1, 2008 5
  • 6. Facebook Data Team The Facebook Data Team built scalable platforms for the collection, management, and analysis of data. We used these platforms to drive informed decisions in areas critical to the success of the company and to build data-intensive products and services. 6
  • 8. Introduction to Data Science 1. Data Preparation 2. Data Presentation 3. Experimentation 4. Observation 5. Data Products 8
  • 10. Philosophy • Instrument everything • Put all of your data in one place • Data first, questions later • Store first, structure later • Keep raw data forever • Let everyone party on the data • Produce tools to support the whole research cycle • Modular and composable infrastructure 10
  • 11. CDH • Storage • Append-only unstructured data • Append-only tabular data • Mutable tabular data 11
  • 12. CDH • Compute • Resource management • Parallel frameworks • High-level interfaces • Libraries 12
  • 13. CDH • Integration • File system API • Database API • Batch data import/export • Event data import • User interface 13
  • 14. Cloudera Products • Subscription • Proprietary software • Support • Training and Certification • Services 14
  • 15. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence 15
  • 16. Cloudera Workloads (Batch) • Active archive • Data reservoir • ETL/ELT offload 16
  • 17. Cloudera Workloads (Interactive) • Application data delivery 17
  • 18. Cloudera Customer Survey • 67% use Hive • 54% use HBase • 51% load data every 90 minutes or less • 71% move data from Hadoop to RDBMS for interactive SQL • 62% would like to consolidate into single platform 18
  • 19. Cloudera Impala • General-purpose SQL query engine • Should work both for analytic and transactional workloads • Will support queries that take from microseconds to hours 19
  • 20. Cloudera Impala • Runs directly within Hadoop • Reads widely used Hadoop file formats • Talks to widely used Hadoop storage managers • Runs on same nodes that run Hadoop processes 20
  • 21. Cloudera Impala • High performance • C++ instead of Java • Runtime code generation • Completely new execution engine—not MapReduce 21
  • 22. Cloudera Impala • Validated Beta Partners • MicroStrategy • QlikView • Tableau • Pentaho • Karmasphere • Capgemini 22
  • 23. New Cloudera Workloads (Interactive) • Operational reporting • Ad hoc query 23
  • 24. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence 24
  • 26. Potential Future Workloads • Search • MPI • Stream processing • Graph computations • Linear algebra • Optimization • Simulation 26
  • 27. The Last Mile • Data libraries • Language • Libraries • IDE for Data Scientists • Mixed-initiative • Memory • Collaboration • Model and analysis path selection 27
  • 28. Doing Data Science • More data sources • More rows • More columns (novel or derived) • Better data quality • Better outcomes • Better loss functions • Causal inference in observational studies • Effect size estimates • Meta-analysis • Model lifecycle 28
  • 29. 29