Data Science Crash Course Hadoop Summit SJ

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Hands-on Intro to Data Science
with Apache Spark
Crash Course

2 © Hortonworks Inc. 2011 –2016. All Rights Reserved
Plan for Today
• Data Science & ML
• ML Examples
• Overview of ML methods
• K-means, Decision Trees & Random Forests
• Spark MLlib & ML
• Lab Overview

Data Science Examples

Predictive Analytics Pre-requisites
Sales Play 4: Predictive Analytics

Predictive Analytics Process and Tools

Machine Learning
“… science of how
computers learn without
being explicitly
programmed” – Andrew Ng

Machine Learning Methods

Supervised
vs
Unsupervised
Learning
Examples
labeled.
Examples not
labeled.

Unsupervised LearningSupervised Learning

CLASSIFICATION
Identifying to which category an object belongs to.
Applications: spam detection, image recognition, ...
Algorithms: k-nn, decision trees, random forest, ...

REGRESSION
Predicting a continuous-valued attribute
associated with an object.
Applications: drug response, stock prices, …
Algorithms: linear regression, …

CLUSTERING
Automatic grouping of similar objects into sets.
Applications: customer segmentation, topic modeling, …
Algorithms: k-means, LDA, …

COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix.
Applications: Product recommendation, …
Algorithms: Alternating Least Squares (ALS)

DIMENSIONALITY REDUCTION
Reducing the number of random variables to consider.
Applications: visualization, increased efficiency, …
Algorithms: PCA, t-SNE, …

PREPROCESSING
Feature extraction and normalization
Applications: transforming input data such as text as input to ML algorithms
Algorithms: TF-IDF, word2vec, one hot encoding, …

MODEL SELECTION
Comparing, validating and choosing parameters and models.
Applications: improved accuracy via parameter tuning
Algorithms: grid search, metrics …

Spark MLlib

Spark Machine Learning Library
Ã Clustering
– k-means clustering
– latent Dirichlet allocation (LDA)
Ã Dimensionality reduction
– singularity value decomposition (SVD)
– principal component analysis (PCA)
Ã Feature Extractors & Transformers
– word2vec
Ã Basic statistics
– summary statistics
– hypothesis testing
– random number generation
Ã Classification and regression
– linear models (SVMs, log & linear regression)
– decision trees
– ensembles of trees (Random Forests & GBTs)
Ã Collaborative filtering
– alternating least squares (ALS)

K-Means Clustering
(Unsupervised Learning)

Why K-Means
Ã Simple & fast algorithm to find clusters
Ã Common technique for anomaly detection
Ã Drawbacks
– Doesn't work well with non-circular cluster shape
– Number of cluster and initial seed value need to be specified beforehand
– Strong sensitivity to outliers and noise
– Low capability to pass the local optimum.

Initialize Cluster Centers
Randomly pick 3
cluster centers.

Assign Each Point
Assign each point
to the nearest
cluster center.

Recompute Cluster Centers
Move each
cluster to the
mean of each
cluster.

K-means Clustering

San Francisco

Outline Each Neighborhood

Folium: choropleth map

SF Neighborhood Centers Calculated with K-Means

Sample Dataset – K-Means
0.0, 0.0, 0.0
0.1, 0.1, 0.1
0.2, 0.2, 0.2
3.0, 3.0, 3.0
3.1, 3.1, 3.1
3.2, 3.2, 3.2

Decision Trees & Random Forests
(Supervised Learning)

Why Decision Trees?
Ã Simple to understand and interpret. (And explain to executives.)
Ã Requires little data preparation. (Other techniques often require data
normalisation, dummy variables need to be created and blank values to be removed.)
Ã Performs well with large datasets.

Visual Intro to Decision Trees
Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1

Random Forest (Ensemble Model)
Ã Main idea: build an ensemble of simple decision trees
Ã Each tree is simple and less likely to overfit
Ã Classify/predict by voting between all trees

Decision Tree vs Random Forest

Overcome limitations of a single hypothesis
Decision Tree Model Averaging
Why Ensembles work?

Diabetes Dataset – Decision Trees / Random Forest
Labeled set with 8 Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...

Machine Learning in Spark

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLlib GraphX

Machine Learning with Spark (MLlib & ML)
Ã Original “lower” API
Ã Built on top of RDDs
Ã Maintenance mode starting with Spark 2.0
MLlib
Ã Newer “higher-level” API for constructing workflows
Ã Built on top of DataFrames
ML
Both algorithms
implemented to take
advantage of data
parallelism

Predict
Model
Supervised Learning: End-to-End Flow
Feature Extraction
Train the
Model
ModelData items
Labels
Data item Feature Extraction Label
Training
(batch)
Predicting
(real time or batch)
Feature Matrix
Feature Vector
Training set

Spark ML: Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Random
Forest
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model

Spark ML Pipeline
Ã Pipeline includes both fit() and transform() methods
– fit() is for training
– transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Spark ML – Simple Random Forest Example
indexer = StringIndexer(inputCol=”district", outputCol=”dis-inx")
parser = Tokenizer(inputCol=”text-field", outputCol="words")
hashingTF = HashingTF(numFeatures=50, inputCol="words", outputCol="hash-inx")
vecAssembler = VectorAssembler(
inputCols =[“dis-inx”, “hash-inx”],
outputCol="features")
rf = RandomForestClassifier(numTrees=100, labelCol="label", seed=42)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Apache Zeppelin – A Modern Web-based Data Science Studio
Ã Data exploration and discovery
Ã Visualization
Ã Deeply integrated with Spark and Hadoop
Ã Pluggable interpreters
Ã Multiple languages in one notebook: R, Python, Scala

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
Ã Supported models
– K-Means
– Linear Regression
– Ridge Regression
– Lasso
– SVM
– Binary

Additional Resources
• Machine Learning
• Natural Language Processing (NLP)
• Scalable Machine Learning
• Introduction to Statistics

Lab Overview
tinyurl.com/hwx-intro-to-ml-with-spark

Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

Robert Hryniewicz
@RobHryniewicz
Thanks!

Data Science Crash Course Hadoop Summit SJ

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data Science Crash Course Hadoop Summit SJ (20)

Recently uploaded (20)

Data Science Crash Course Hadoop Summit SJ