SlideShare a Scribd company logo
Machine	
  Learning	
  and	
  Hadoop	
  
Present	
  and	
  Future	
  
Josh	
  Wills	
  
Cloudera	
  Data	
  Science	
  Team   	
  
September	
  6th,	
  2012	
  
About	
  Me	
  




                  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Outline	
  

•  Part	
  1:	
  Industrial	
  Machine	
  Learning	
  

•  Part	
  2:	
  ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  

•  Part	
  3:	
  ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
(Academic)	
  ML	
  vs.	
  (Academic)	
  StaIsIcs	
  

	
  
	
  
	
  
“Machine	
  learning	
  is	
  sta/s/cs	
  minus	
  any	
  checking	
  of	
  
models	
  and	
  assump/ons.”	
  
     	
   	
   	
   	
   	
   	
   	
   	
  -­‐-­‐	
  Brian	
  Ripley,	
  UseR!	
  2004	
  
     	
   	
   	
   	
   	
   	
   	
   	
  (provoca/vely	
  paraphrased)	
  




                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #1	
  

	
  
	
  
                                             	
  
  The	
  thing	
  that	
  we	
  are	
  trying	
  to	
  predict	
  is	
  rarely	
  the	
  thing	
  
                     that	
  we	
  are	
  trying	
  to	
  opImize.        	
  




                              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #2	
  

	
  
	
  
	
  
	
  
                 Systems	
  precede	
  algorithms.
                                                 	
  




                     Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #3	
  




                                                                                              Practice Over Theory Blog



                     Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ImplicaIon	
  

	
  
	
  
	
  
       Data	
  science	
  requires	
  predicIon-­‐oriented	
  machine	
  
       learning	
  models	
  AND	
  classical,	
  rigorous	
  staIsIcal	
  
                                  analysis.  	
  
	
  




                         Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Outline	
  

•  Part	
  1:	
  Industrial	
  Machine	
  Learning	
  

•  Part	
  2:	
  ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  

•  Part	
  3:	
  ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
“Hadoop.	
  It’s	
  Where	
  The	
  Data	
  Is.”	
  




                      Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Hadoop	
  PlaWorm:	
  Substrate	
  


•    Commodity	
  servers	
  
•    Open	
  source	
  operaFng	
  system	
  
•    “”	
  ConfiguraFon	
  Management	
  
•    “”	
  CoordinaFon	
  Service	
  
•    “”	
  File	
  System	
  API	
  
•    “”	
  Efficient	
  and	
  Extensible	
  File	
  Formats	
  
•    “”	
  Efficient	
  and	
  Extensible	
  RPC	
  Libraries	
  


                           Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Hadoop	
  PlaWorm:	
  MapReduce	
  Frameworks	
  

•  Languages/Environments	
  
       •  PigLaFn	
  (Apache)	
  
       •  HiveQL	
  (Apache)	
  
       •  Jaql	
  (IBM)	
  
•  Java/Scala	
  APIs	
  
       •    Crunch	
  (Apache	
  Incubator)	
  
       •    Scoobi	
  (NICTA)	
  
       •    Cascading	
  (Concurrent)	
  
       •    Pangool	
  
	
  

                              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  




              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
MapReduce	
  

•  Great	
  for:	
  
    •  Data	
  PreparaFon	
  
    •  Feature	
  Engineering	
  
    •  Model	
  ValidaFon/EvaluaFon	
  
•  Works	
  Well	
  For	
  Certain	
  Model	
  Fing	
  Problems	
  
    •  CollaboraFve	
  Filtering	
  Algorithms	
  
    •  ExpectaFon	
  MaximizaFon	
  
    •  Decision	
  Trees	
  (PLANET;	
  Gradient	
  Boosted	
  Decision	
  Trees)	
  
•  Not	
  A	
  PracIcal	
  OpIon	
  for	
  Many	
  Kinds	
  of	
  Problems	
  
•  Way	
  More	
  Detail	
  in	
  the	
  KDD	
  2011	
  Talk	
  
                          Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Apache	
  Mahout	
  

•  The	
  starFng	
  place	
  for	
  MapReduce-­‐based	
  machine	
  
   learning	
  algorithms	
  
    •  Not	
  machine-­‐learning-­‐in-­‐a-­‐box	
  
    •  Custom	
  tweaks/modificaFons	
  are	
  the	
  rule	
  
•  A	
  disparate	
  collecFon	
  of	
  algorithms	
  for:	
  
    •    RecommendaFons	
  
    •    Clustering	
  
    •    ClassificaFon	
  
    •    Frequent	
  Itemset	
  Mining	
  



                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Apache	
  Mahout	
  (cont.)	
  

•  Best	
  Library:	
  Taste	
  Recommender	
  
    •  Oldest	
  project,	
  most	
  widely-­‐deployed	
  in	
  producFon	
  
    •  SVD	
  implementaFon	
  is	
  parFcularly	
  acFve	
  


•  Good	
  Libraries:	
  Online	
  SGD	
  
    •  Does	
  not	
  use	
  MapReduce	
  
    •  Vowpal	
  Rabbit	
  is	
  faster,	
  has	
  L-­‐BFGS	
  opFon	
  


•  Roll	
  Your	
  Own	
  Instead:	
  Naïve	
  Bayes	
  
	
  
                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
The	
  Ominous	
  Challenges
                           	
  




 Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
1.	
  The	
  Secret	
  Sauce	
  Effect
                                    	
  




   Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
2.	
  Delta	
  Between	
  Mahout	
  and	
  the	
  Cu_ng	
  Edge	
  




                  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Moving	
  Beyond	
  MapReduce	
  




                 Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
The	
  Contenders
                           	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
AllReduce	
  

•  Developed	
  at	
  Yahoo!	
  Research	
  
•  Defines	
  the	
  allreduce	
  operaFon	
  
    •  N	
  machines	
  each	
  have	
  a	
  number	
  =>	
  each	
  machine	
  has	
  the	
  
       sum	
  of	
  the	
  numbers	
  
•  At	
  the	
  heart	
  of	
  Vowpal	
  Wabbit’s	
  performance	
  
•  Implemented	
  in	
  C++	
  
•  Can	
  be	
  patched	
  into	
  Apache	
  Hadoop	
  and	
  used	
  today	
  




                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Spark	
  

 •  Developed	
  at	
  Berkeley’s	
  
    AMP	
  Lab	
  
 •  Defines	
  operaFons	
  on	
  
    distributed	
  in-­‐memory	
  
    collecFons	
  
 •  Wriken	
  in	
  Scala	
  
 •  Supports	
  reading	
  to	
  and	
  
    wriFng	
  from	
  HDFS	
  


                       Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
GraphLab	
  

 •  Developed	
  at	
  CMU	
  
 •  Lower-­‐level	
  primiFves	
  
     •  (but	
  higher	
  than	
  MPI)	
  
 •  Map/Reduce	
  =>	
  
    Update/Sort	
  
 •  Flexible,	
  allows	
  for	
  
    asynchronous	
  
    computaFons	
  
 •  Reads	
  from	
  HDFS	
  

                          Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
How	
  Things	
  Measure	
  Up	
  




  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Speed	
  vs.	
  Reliability	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Memory	
  vs.	
  Disk	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
C++	
  vs.	
  JVM	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
QuesIons?	
  
(Ask	
  Anything.	
  Anything	
  At	
  All.)	
  
            jwills@cloudera.com	
  

More Related Content

PDF
Cloudera Showcase: SQL-on-Hadoop
PPTX
Spark One Platform Webinar
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PPTX
Facial recognition
PPTX
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
PPTX
Getting Started with MySQL Full Text Search
PDF
Spark forspringdevs springone_final
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera Showcase: SQL-on-Hadoop
Spark One Platform Webinar
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Facial recognition
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
Getting Started with MySQL Full Text Search
Spark forspringdevs springone_final
Unlock Hadoop Success with Cloudera Navigator Optimizer

What's hot (20)

PPTX
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
PDF
快速数据快速分析引擎-Kudu
PDF
Large scale topic modeling
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PPTX
Apache Spark: Usage and Roadmap in Hadoop
PDF
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
PPTX
S2DS London 2015 - Hadoop Real World
ODP
The power of hadoop in cloud computing
PDF
Train, predict, serve: How to go into production your machine learning model
PPTX
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
PPT
Data Science Day New York: The Platform for Big Data
PDF
大数据数据治理及数据安全
PDF
Apache Hadoop 3
PPTX
Stinger.Next by Alan Gates of Hortonworks
PDF
Cloudera のサポートエンジニアリング #supennight
KEY
Hortonworks: Agile Analytics Applications
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
PDF
How to go into production your machine learning models? #CWT2017
PDF
Best Practices for Virtualizing Apache Hadoop
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
快速数据快速分析引擎-Kudu
Large scale topic modeling
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Apache Spark: Usage and Roadmap in Hadoop
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
S2DS London 2015 - Hadoop Real World
The power of hadoop in cloud computing
Train, predict, serve: How to go into production your machine learning model
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
Data Science Day New York: The Platform for Big Data
大数据数据治理及数据安全
Apache Hadoop 3
Stinger.Next by Alan Gates of Hortonworks
Cloudera のサポートエンジニアリング #supennight
Hortonworks: Agile Analytics Applications
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
How to go into production your machine learning models? #CWT2017
Best Practices for Virtualizing Apache Hadoop
Ad

Viewers also liked (20)

PDF
Samsung mobile root
PDF
Nigerian design and digital marketing agency
PDF
Intro to linux performance analysis
PDF
VideoLan VLC Player App Artifact Report
PPTX
History of L0phtCrack
PDF
脆弱性診断って何をどうすればいいの?(おかわり)
PDF
Open Source Security Testing Methodology Manual - OSSTMM by Falgun Rathod
PPTX
Dangerous google dorks
PDF
How to Setup A Pen test Lab and How to Play CTF
PDF
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
PPTX
Nmap not only a port scanner by ravi rajput comexpo security awareness meet
PDF
Hacking in shadows By - Raghav Bisht
PPT
Learning sed and awk
PDF
Nmap Basics
PPTX
Nmap 9 truth "Nothing to say any more"
PDF
Hacking With Nmap - Scanning Techniques
PDF
Linux intro 4 awk + makefile
PDF
Linux intro 5 extra: makefiles
PDF
Linux intro 2 basic terminal
PDF
Linux intro 5 extra: awk
Samsung mobile root
Nigerian design and digital marketing agency
Intro to linux performance analysis
VideoLan VLC Player App Artifact Report
History of L0phtCrack
脆弱性診断って何をどうすればいいの?(おかわり)
Open Source Security Testing Methodology Manual - OSSTMM by Falgun Rathod
Dangerous google dorks
How to Setup A Pen test Lab and How to Play CTF
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
Nmap not only a port scanner by ravi rajput comexpo security awareness meet
Hacking in shadows By - Raghav Bisht
Learning sed and awk
Nmap Basics
Nmap 9 truth "Nothing to say any more"
Hacking With Nmap - Scanning Techniques
Linux intro 4 awk + makefile
Linux intro 5 extra: makefiles
Linux intro 2 basic terminal
Linux intro 5 extra: awk
Ad

Similar to Machine Learning and Hadoop: Present and Future (20)

PPTX
Machine Learning and Hadoop: Present and future
PPTX
Hadoop and Machine Learning
PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PDF
Data Science and Machine Learning for the Enterprise
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PDF
Introduction to Data Science with Hadoop
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
Data Science in Enterprise
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PDF
Intro to hadoop tutorial
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Amr Awadallah, unSEXY Presentation
PDF
Machine Learning in the Enterprise 2019
PPTX
巨量資料入門 The evolution of data architecture
PDF
10 Common Hadoop-able Problems Webinar
Machine Learning and Hadoop: Present and future
Hadoop and Machine Learning
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Data Science and Machine Learning for the Enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Introduction to Data Science with Hadoop
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
Data Science in Enterprise
Part 1: Introducing the Cloudera Data Science Workbench
The Edge to AI Deep Dive Barcelona Meetup March 2019
Intro to hadoop tutorial
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Amr Awadallah, unSEXY Presentation
Machine Learning in the Enterprise 2019
巨量資料入門 The evolution of data architecture
10 Common Hadoop-able Problems Webinar

More from Data Science London (20)

PPTX
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
PDF
Nowcasting Business Performance
PDF
Numpy, the Python foundation for number crunching
PDF
Python pandas workshop iPython notebook (163 pages)
PPTX
Big Practical Recommendations with Alternating Least Squares
PDF
Bringing back the excitement to data analysis
PDF
Survival Analysis of Web Users
PDF
ACM RecSys 2012: Recommender Systems, Today
PDF
Beyond Accuracy: Goal-Driven Recommender Systems Design
PDF
Autonomous Discovery: The New Interface?
PDF
Data Science for Live Music
PDF
Research at last.fm
PDF
Music and Data: Adding Up the UK Music Industry
PDF
Scientific Article Recommendations with Mahout
PPTX
Super-Fast Clustering Report in MapR
PPTX
Simple Matrix Factorization for Recommendation in Mahout
PPTX
Going Real-Time with Mahout, Predicting gender of Facebook Users
PDF
Practical Magic with Incanter
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Real-Time Queries in Hadoop w/ Cloudera Impala
Nowcasting Business Performance
Numpy, the Python foundation for number crunching
Python pandas workshop iPython notebook (163 pages)
Big Practical Recommendations with Alternating Least Squares
Bringing back the excitement to data analysis
Survival Analysis of Web Users
ACM RecSys 2012: Recommender Systems, Today
Beyond Accuracy: Goal-Driven Recommender Systems Design
Autonomous Discovery: The New Interface?
Data Science for Live Music
Research at last.fm
Music and Data: Adding Up the UK Music Industry
Scientific Article Recommendations with Mahout
Super-Fast Clustering Report in MapR
Simple Matrix Factorization for Recommendation in Mahout
Going Real-Time with Mahout, Predicting gender of Facebook Users
Practical Magic with Incanter

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Machine Learning and Hadoop: Present and Future

  • 1. Machine  Learning  and  Hadoop   Present  and  Future   Josh  Wills   Cloudera  Data  Science  Team   September  6th,  2012  
  • 2. About  Me   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 3. Outline   •  Part  1:  Industrial  Machine  Learning   •  Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 4. (Academic)  ML  vs.  (Academic)  StaIsIcs         “Machine  learning  is  sta/s/cs  minus  any  checking  of   models  and  assump/ons.”                  -­‐-­‐  Brian  Ripley,  UseR!  2004                  (provoca/vely  paraphrased)   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 5. Industrial  Machine  Learning:  Truth  #1         The  thing  that  we  are  trying  to  predict  is  rarely  the  thing   that  we  are  trying  to  opImize.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 6. Industrial  Machine  Learning:  Truth  #2           Systems  precede  algorithms.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 7. Industrial  Machine  Learning:  Truth  #3   Practice Over Theory Blog Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 8. ImplicaIon         Data  science  requires  predicIon-­‐oriented  machine   learning  models  AND  classical,  rigorous  staIsIcal   analysis.     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 9. Outline   •  Part  1:  Industrial  Machine  Learning   •  Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 10. “Hadoop.  It’s  Where  The  Data  Is.”   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 11. Hadoop  PlaWorm:  Substrate   •  Commodity  servers   •  Open  source  operaFng  system   •  “”  ConfiguraFon  Management   •  “”  CoordinaFon  Service   •  “”  File  System  API   •  “”  Efficient  and  Extensible  File  Formats   •  “”  Efficient  and  Extensible  RPC  Libraries   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 12. Hadoop  PlaWorm:  MapReduce  Frameworks   •  Languages/Environments   •  PigLaFn  (Apache)   •  HiveQL  (Apache)   •  Jaql  (IBM)   •  Java/Scala  APIs   •  Crunch  (Apache  Incubator)   •  Scoobi  (NICTA)   •  Cascading  (Concurrent)   •  Pangool     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 13. ML  and  Hadoop:  The  State  of  the  World   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 14. MapReduce   •  Great  for:   •  Data  PreparaFon   •  Feature  Engineering   •  Model  ValidaFon/EvaluaFon   •  Works  Well  For  Certain  Model  Fing  Problems   •  CollaboraFve  Filtering  Algorithms   •  ExpectaFon  MaximizaFon   •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)   •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems   •  Way  More  Detail  in  the  KDD  2011  Talk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 15. Apache  Mahout   •  The  starFng  place  for  MapReduce-­‐based  machine   learning  algorithms   •  Not  machine-­‐learning-­‐in-­‐a-­‐box   •  Custom  tweaks/modificaFons  are  the  rule   •  A  disparate  collecFon  of  algorithms  for:   •  RecommendaFons   •  Clustering   •  ClassificaFon   •  Frequent  Itemset  Mining   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 16. Apache  Mahout  (cont.)   •  Best  Library:  Taste  Recommender   •  Oldest  project,  most  widely-­‐deployed  in  producFon   •  SVD  implementaFon  is  parFcularly  acFve   •  Good  Libraries:  Online  SGD   •  Does  not  use  MapReduce   •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon   •  Roll  Your  Own  Instead:  Naïve  Bayes     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 17. The  Ominous  Challenges   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 18. 1.  The  Secret  Sauce  Effect   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 19. 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 20. ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 21. Moving  Beyond  MapReduce   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 22. The  Contenders   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 23. AllReduce   •  Developed  at  Yahoo!  Research   •  Defines  the  allreduce  operaFon   •  N  machines  each  have  a  number  =>  each  machine  has  the   sum  of  the  numbers   •  At  the  heart  of  Vowpal  Wabbit’s  performance   •  Implemented  in  C++   •  Can  be  patched  into  Apache  Hadoop  and  used  today   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 24. Spark   •  Developed  at  Berkeley’s   AMP  Lab   •  Defines  operaFons  on   distributed  in-­‐memory   collecFons   •  Wriken  in  Scala   •  Supports  reading  to  and   wriFng  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 25. GraphLab   •  Developed  at  CMU   •  Lower-­‐level  primiFves   •  (but  higher  than  MPI)   •  Map/Reduce  =>   Update/Sort   •  Flexible,  allows  for   asynchronous   computaFons   •  Reads  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 26. How  Things  Measure  Up   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 27. Speed  vs.  Reliability   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 28. Memory  vs.  Disk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 29. C++  vs.  JVM   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 30. QuesIons?   (Ask  Anything.  Anything  At  All.)   jwills@cloudera.com