SlideShare a Scribd company logo
Building Analytical Applications on PUBLICLY
                                 DO NOT USE
    Hadoop                        PRIOR TO 10/23/12
    Headline Goes Here
    Josh Wills | Director of Data Science
    Speaker Name or Subhead Goes Here
    November 2012




1
About Me




2
What are ‘Analytical Applications?’




3
The Humble Dashboard




4
Crossfilter with Flight Information




5
New York Times Electoral Vote Map




6
New York Times Electoral Vote Map (Detail)




7
Analytical Applications vs. Frameworks




8
Developing Analytical Applications
    A Case Study




9
2012: The Predicting of the President




10
RealClearPolitics

     • Simple Average of Polls


     • Transparent


     • Simple Interactions



11
FiveThirtyEight

                       • “Foxy” Model


                       • Opaque


                       • Simple Interactions with
                        a richer UI

12
Princeton Election Consortium

     • Medians and
      Polynomials

     • Transparent


     • Rich Interactions


13
How Did They Do?




14
A Few of These, Because They’re Fun




15
A Few of These, Because They’re Fun




16
A Few of These, Because They’re Fun




17
Here’s the Rub: One Expert Beat Nate




18
Index Funds, Hedge Funds, and Warren Buffett




19
A Brief Introduction to Hadoop




20
Data Storage in 2001: Databases

     • Structured schemas
     • Intensive processing
       done where data is
       stored
     • Somewhat reliable
     • Expensive at scale


21
Data Storage in 2001: Filers

                               • No schemas, stores any
                                 kind of file
                               • No data processing
                                 capability
                               • Reliable
                               • Expensive at scale


22
And Then, This Happened




23
Data Economics: Return on Byte




24
Big Data Economics

     • No individual record is
       particularly valuable
     • Having every record is
       incredibly valuable
         •   Web index
         •   Recommendation systems
         •   Sensor data
         •   Market basket analysis
         •   Online advertising

25
Introduction to Hadoop




26
The Hadoop Distributed File System

     • Based on the Google File
       System
     • Data stored in large files
        • Large block size: 64MB to
          256MB per block
        • Blocks are replicated to
          multiple nodes in the
          cluster

27
Simple, Reliable Processing: MapReduce
     •   Map Stage
          •   Embarrassingly parallel
     • Shuffle Stage: Large-scale distributed sort
     • Reduce Stage
          •   Process all of the values that have the same key in a single step
     • Process the data where it is stored
     • Write once and you’re done.



28
Developing Analytical Applications
     with Hadoop




29
Novelty is the Enemy of Adoption




30
The Best Way to Get Started: Apache Hive

     •   Apache Hive
          •   Data Warehouse System on
              top of Hadoop
     •   SQL-based query language
          • SELECT, INSERT, CREATE
            TABLE
          • Includes some MapReduce-
            specific extensions

31
Borrowing Abstractions




32
Improving the UX (http://github.com/cloudera/impala)




33
Moving Beyond the Abstractions




34
Making the Abstract Concrete




35
Cloudera’s Data Science Course




36
Analytical Applications I Love




37
The Experiments Dashboard




38
Adverse Drug Events




39
Gene Sequencing and Analytics




40
The Doctor’s Perspective




41
A Couple of Themes
     1.   Structure data the data in the way that makes sense for the
          problem.

     2.   Interactive inputs, not just interactive outputs.

     3.   Simpler interfaces that yield more sophisticated answers.



42
Working Towards The Dream




43
Developing Analytical Applications
     Moving Beyond MapReduce




44
The Cambrian Explosion…of Frameworks 




45
It’s Frameworks All The Way Down: Spark

     • Developed at Berkeley’s
       AMP Lab
     • Defines operations on
       distributed in-memory
       collections
     • Written in Scala
     • Supports reading to and
       writing from HDFS

46
IFATWD: Graphlab

     • Developed at CMU
     • Lower-level primitives
         •   (but higher than MPI)
     • Map/Reduce =>
       Update/Sort
     • Flexible, allows for
       asynchronous
       computations
     • Reads from HDFS

47
Playing with YARN




48
BranchReduce (http://github.com/cloudera/branchreduce)




49
50

More Related Content

PDF
Spark at Zillow
PPTX
The Business Economics and Opportunity of Open Source Data Science
PDF
Data Science At Zillow
PDF
Introduction to basic data analytics tools
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
PDF
From hadoop to spark
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
PPTX
Big Data Open Source Technologies
Spark at Zillow
The Business Economics and Opportunity of Open Source Data Science
Data Science At Zillow
Introduction to basic data analytics tools
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
From hadoop to spark
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Open Source Technologies

What's hot (20)

PDF
Big data technology
PPTX
Big data analysis using hadoop cluster
PDF
From Big Data to Fast Data
PPSX
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
PDF
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
PPTX
QuantCell Research - The Big Data Spreadsheet
PPTX
Obfuscating LinkedIn Member Data
PPT
Data mining tools used in business intelligence
PDF
Big Data Streams Architectures. Why? What? How?
PPTX
Real-Time Big Data
PPTX
Telco analytics at scale
PDF
Modern Big Data Analytics Tools: An Overview
PDF
The Rise of Engineering-Driven Analytics by Loren Shure
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
PDF
Big data real time architectures
PPTX
Managed Cluster Services
PPTX
Data pipelines from zero
PDF
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
PDF
Organising for Data Success
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Big data technology
Big data analysis using hadoop cluster
From Big Data to Fast Data
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
QuantCell Research - The Big Data Spreadsheet
Obfuscating LinkedIn Member Data
Data mining tools used in business intelligence
Big Data Streams Architectures. Why? What? How?
Real-Time Big Data
Telco analytics at scale
Modern Big Data Analytics Tools: An Overview
The Rise of Engineering-Driven Analytics by Loren Shure
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Big data real time architectures
Managed Cluster Services
Data pipelines from zero
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Organising for Data Success
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Ad

Similar to Builiding analytical apps on Hadoop (20)

PDF
Big data
PDF
Business of Big Data
PPT
Big Data = Big Decisions
PDF
Big Data is changing abruptly, and where it is likely heading
PDF
Apache hadoop bigdata-in-banking
PPTX
Gilbane Boston 2012 Big Data 101
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PPTX
NoSQL for the SQL Server Pro
PPTX
Bw tech hadoop
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PPTX
Big data – a brief overview
PDF
Dba to data scientist -Satyendra
PPT
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
PDF
Introduction to Big Data
PDF
Hadoop Overview kdd2011
PPTX
Big Data & Hadoop Introduction
PDF
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PDF
The Hadoop Ecosystem for Developers
PDF
Big Data Processing with Hadoop : A Review
Big data
Business of Big Data
Big Data = Big Decisions
Big Data is changing abruptly, and where it is likely heading
Apache hadoop bigdata-in-banking
Gilbane Boston 2012 Big Data 101
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
NoSQL for the SQL Server Pro
Bw tech hadoop
BW Tech Meetup: Hadoop and The rise of Big Data
Big data – a brief overview
Dba to data scientist -Satyendra
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
Introduction to Big Data
Hadoop Overview kdd2011
Big Data & Hadoop Introduction
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
The Hadoop Ecosystem for Developers
Big Data Processing with Hadoop : A Review
Ad

More from Dmitry Makarchuk (11)

PDF
Linzer slides-barug
PDF
2012 11-28 rich web data modeling with graphs-1
PDF
2012 11-28 rich web data modeling with graphs-1
PPTX
Hadoop and mysql by Chris Schneider
PPTX
A random forest approach to skin detection with r
PPTX
"Your script just killed my site" by Steve Souders
PDF
RBrowserPlugin Project (Gabriel Becker)
PDF
Bridge to r
PPTX
Jesse Yates: Hbase snapshots patch
PPT
Phoenix h basemeetup
PPTX
Mongo DB in gaming industry
Linzer slides-barug
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1
Hadoop and mysql by Chris Schneider
A random forest approach to skin detection with r
"Your script just killed my site" by Steve Souders
RBrowserPlugin Project (Gabriel Becker)
Bridge to r
Jesse Yates: Hbase snapshots patch
Phoenix h basemeetup
Mongo DB in gaming industry

Builiding analytical apps on Hadoop

Editor's Notes

  • #4: They are applications that allow users to work with and make decisions from data.
  • #5: It seems like there should be a UX equivalent of Clippy– maybe like a tiny picture of Edward Tufte– that pops up whenever someone decides to use a 3D pie chart.
  • #6: http://square.github.com/crossfilter/
  • #7: http://elections.nytimes.com/2012/results/president (Click on “Shift from 2008”)
  • #8: Click on a state to zoom in
  • #9: Frameworks != Analytical applicatons, for our purposes today. It’s not an analytical application until you put some data in it.
  • #11: A few different models were developed for predicting the presidency in 2012– let’s consider a few of them.
  • #12: http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map.html
  • #13: http://fivethirtyeight.blogs.nytimes.com/
  • #14: http://election.princeton.edu/
  • #15: http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
  • #19: MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
  • #20: Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
  • #25: Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
  • #31: We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
  • #32: It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
  • #33: We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
  • #34: Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
  • #35: But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
  • #36: Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
  • #37: I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
  • #38: They are applications that allow users to work with and make decisions from data.
  • #40: http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
  • #41: http://www.slideshare.net/cloudera/7-leveraging-h-base-for-the-worlds-largest-curated-genomic-data-collection-satnam-alag-nextbio-finalupdatedlastminute
  • #42: The truth is that building tools for unsophisticated users typically requires incredibly sophisticated development.
  • #44: An open-source version of Wolfram Alpha for useful data.
  • #49: https://github.com/cloudera/kitten
  • #50: http://github.com/cloudera/branchreduce