Builiding analytical apps on Hadoop

Building Analytical Applications on PUBLICLY
DO NOT USE
Hadoop PRIOR TO 10/23/12
Headline Goes Here
Josh Wills | Director of Data Science
Speaker Name or Subhead Goes Here
November 2012

1

What are ‘Analytical Applications?’

3

Crossfilter with Flight Information

5

New York Times Electoral Vote Map

6

New York Times Electoral Vote Map (Detail)

7

Analytical Applications vs. Frameworks

8

Developing Analytical Applications
A Case Study

9

2012: The Predicting of the President

10

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions

11

FiveThirtyEight

• “Foxy” Model

• Opaque

• Simple Interactions with
a richer UI

12

Princeton Election Consortium

• Medians and
Polynomials

• Transparent

• Rich Interactions

13

A Few of These, Because They’re Fun

15


16


17

Here’s the Rub: One Expert Beat Nate

18

Index Funds, Hedge Funds, and Warren Buffett

19

A Brief Introduction to Hadoop

20

Data Storage in 2001: Databases

• Structured schemas
• Intensive processing
done where data is
stored
• Somewhat reliable
• Expensive at scale

21

Data Storage in 2001: Filers

• No schemas, stores any
kind of file
• No data processing
capability
• Reliable
• Expensive at scale

22

And Then, This Happened

23

Data Economics: Return on Byte

24

Big Data Economics

• No individual record is
particularly valuable
• Having every record is
incredibly valuable
• Web index
• Recommendation systems
• Sensor data
• Market basket analysis
• Online advertising

25

The Hadoop Distributed File System

• Based on the Google File
System
• Data stored in large files
• Large block size: 64MB to
256MB per block
• Blocks are replicated to
multiple nodes in the
cluster

27

Simple, Reliable Processing: MapReduce
• Map Stage
• Embarrassingly parallel
• Shuffle Stage: Large-scale distributed sort
• Reduce Stage
• Process all of the values that have the same key in a single step
• Process the data where it is stored
• Write once and you’re done.

28

with Hadoop

29

Novelty is the Enemy of Adoption

30

The Best Way to Get Started: Apache Hive

• Apache Hive
• Data Warehouse System on
top of Hadoop
• SQL-based query language
• SELECT, INSERT, CREATE
TABLE
• Includes some MapReduce-
specific extensions

31

Improving the UX (http://github.com/cloudera/impala)

33

Moving Beyond the Abstractions

34

Making the Abstract Concrete

35

Cloudera’s Data Science Course

36

Analytical Applications I Love

37

The Experiments Dashboard

38

Gene Sequencing and Analytics

40

The Doctor’s Perspective

41

A Couple of Themes
1. Structure data the data in the way that makes sense for the
problem.

2. Interactive inputs, not just interactive outputs.

3. Simpler interfaces that yield more sophisticated answers.

42

Working Towards The Dream

43

Moving Beyond MapReduce

44

The Cambrian Explosion…of Frameworks 

45

It’s Frameworks All The Way Down: Spark

• Developed at Berkeley’s
AMP Lab
• Defines operations on
distributed in-memory
collections
• Written in Scala
• Supports reading to and
writing from HDFS

46

IFATWD: Graphlab

• Developed at CMU
• Lower-level primitives
• (but higher than MPI)
• Map/Reduce =>
Update/Sort
• Flexible, allows for
asynchronous
computations
• Reads from HDFS

47

BranchReduce (http://github.com/cloudera/branchreduce)

49

Builiding analytical apps on Hadoop

More Related Content

What's hot (20)

Similar to Builiding analytical apps on Hadoop (20)

More from Dmitry Makarchuk (11)

Builiding analytical apps on Hadoop

Editor's Notes