Why hadoop for data science?

Why Hadoop for data science?

Ofer Mendelevitch
PASS BA Conference, April 2013

© Hortonworks Inc. 2013

A brief history of Apache Hadoop

Apache Project Yahoo! begins to Hortonworks
Established Operate at scale Data Platform

2013
2004 2006 2008 2010 2012 Enterprise
Hadoop
2005: Yahoo! creates
team under E14 to Focus on INNOVATION
work on Hadoop

2008: Yahoo team extends focus to
operations to support multiple Focus on OPERATIONS
projects & growing clusters

2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24 STABILITY
key Hadoop engineers from Yahoo

Page 2

Core Hadoop: HDFS & Map Reduce

Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store

• Map-reduce: distributed computation framework that
handles the complexities of distributed programming

Page 3

Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-
developed to work together

• Process data in parallel across thousands of
“commodity” hardware nodes
– Self-healing; failure handled by software

• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013 Page 4

HDP: Enterprise-Ready Hadoop
OPERATIONAL DATA
SERVICES SERVICES
Manage &
AMBARI FLUME Store, HIVE
PIG
Operate at Process and HBASE
Scale SQOOP Access Data
OOZIE HCATALOG

MAP REDUCE
Distributed
HADOOP CORE Storage & Processing
HDFS

Enterprise Readiness: HA,
PLATFORM SERVICES DR, Snapshots, Security, …

HORTONWORKS
DATA PLATFORM (HDP)

OS / VM Cloud Appliance


What is a
data product?

What is a data product?

“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”


Example 1: Google Adwords


Example 2: People you may know


Example 3: spell correction


What is
data science?

What is data science?

#1: Extracting deep meaning from data
(data mining; finding “gems” in data)


Common data science tasks

Descriptive Predictive

Clustering Classification
Detect natural groupings Predict a category
Outlier detection Regression
Detect anomalies Predict a value
Affinity Analysis Recommendation
Co-occurrence patterns Predict a preference


What is data science?

#2: Building data products
(Delivering gems on a regular basis)
Online serving

Pre-process Build model SQL

Periodic batch processing



Reason #1:
Explore full datasets


Explore large datasets directly with Hadoop

Researcher laptop
R, Matlab, SAS, etc

Measure/Evaluate

Model Acquire

Full dataset stored on Hadoop

Visualize, Grok Clean Data


Integrate Hadoop in your data analysis flow

• Exploratory data analysis on full dataset
–Simple statistics: mean, median, quantile, etc
–Pre-processing: grep, regex, etc

• Ad-hoc sampling / filtering
–Random: with or without replacement
–Sample by unique key
–K-fold cross-validation



Reason #2:
Mine larger datasets


More data -> better outcomes

Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009


Learning algorithms with large datasets…

Challenges:
• Data won’t fit in memory
• Learning takes a lot longer…

Using Hadoop:
• Distribute data across nodes in the Hadoop cluster
• Implement a distributed/parallel algorithm
–Recommendation: Alternate Least Squares (ALS)
–Clustering: K-means



Reason #3:
Large-scale data preparation


80% of data science work is data preparation

Sampling, filtering

Joins

Processed
Raw Data Entity resolution
Data
Strip away
HTML/PDF/DOC/P
PT
Document vector
generation
Term
normalization


Hadoop is ideal for batch data preparation and
cleanup of large datasets



Reason 4:
Accelerate data-driven innovation


Barriers to speed with traditional data architectures

• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation

Finally, Let me
I need
we start see… is it
new data
collecting any good?

Start 6 months 9 months

Schema change project


“Schema on read” means faster time-to-innovation

• Hadoop uses “schema on read”
• Low barrier for data-driven innovation

Let me
I need My model is
see… is it
new data awesome!
any good?

Start 3 months 6 months

Let’s just put
it in a folder
on HDFS


Summary

Why use Hadoop for data science?
1. Data exploration with full datasets
2. Mine larger datasets
3. Pre-processing at scale
4. Faster data-driven cycles


Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes

Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Sign up for Training for in-depth learning
hortonworks.com/hadoop-training/

Page 28

Thank you!

Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks

Come visit us @ Booth S5
We’re hiring!


Why hadoop for data science?

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Why hadoop for data science? (20)

More from Hortonworks (20)

Why hadoop for data science?

Editor's Notes