Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco
Lab Overview
● ‘Hello world’ RDD example
● Importing a dataset
● Dataframe operations and visualizations
● Using MLLib on dataset

Lab — Hello World
● ./run_spark

Lab — Hello World
● val text = sc.parallelize(Seq(“your text here”))
● val words = text.flatMap(line => line.split(" "))
● words.collect

Lab — Hello World
● val taggedWords = words.map(word => (word,1))
● val counts = taggedWords.reduceByKey(_ + _)
● counts.collect()

Lab — Dataset
● https://archive.ics.uci.edu/ml/datasets/Wine
● Information on 3 different types of wine from Genoa
● 178 entries (small!)

Lab — Loading The Wine Dataset
● val wines = sqlContext.read.json("wine.json")
● wines.registerTempTable(“wines”)

Lab — Showing the generated Schema
● wines.printSchema

Lab — Dataframe Operations
● wines.first

● sqlContext.sql("SELECT Type, count(Type) AS count FROM
wines GROUP BY Type").show

● Experiment with %sql on the dataset (SELECT, COUNT, etc)

Lab — K-means Clustering
● K-Means clustering is an unsupervised algorithm which splits a
dataset into a number of clusters (k) based on a notion of
similarity between points. It is often applied to real-world data
to obtain a picture of structure hidden in large datasets, for
example, identifying location clusters or breaking down sales
into distinct purchasing groups.

k initial "means" (in this case k=3)
are randomly generated within the
data domain (shown in colour).

k (in this case, 3) clusters are
created by comparing each data
point to the closest mean.

The centroid of each of these
clusters is found, and these are
used as new means. New clusters
are formed via observing the
closest data points to these new
mean as shown in Step 2. The
process is repeated until the means
converge (or until we hit our
iteration limit)

Lab — K-means Clustering: Imports
● import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
● import org.apache.spark.sql._

Lab — K-means Clustering: Features
● val featureCols = wines.select("Alcohol", "Hue", "Proline")
● val features = featureCols.rdd.map { case Row(a: Double, h:
Double, p: Double) => Vectors.dense(a,h,p) }
● features.cache

Lab — K-means Clustering: Training Model
● val numClusters = 2
● val numIterations = 20
● val model = KMeans.train(features, numClusters,
numIterations)

Lab — K-means Clustering: Finding k
● k can be any number you like!
● WSSSE - Within Set Sum of Squared Error
● Squared sum of distances between points and their respective
centroid
● val wssse = model.computeCost(features)

Lab — K-means Clustering: Finding k
● Test on k = 1 to 5
● (1 to 5 by 1).map (k => KMeans.train(features, k,
numIterations).computeCost(features))
● WSSSE normally decreases as k increases
● Look for the ‘elbow’

Lab — K-means Clustering: Training Model
● val wssse = KMeans.train(features, numClusters,
numIterations).computeCost(features)

Lab — K-means Clustering: k = 3
● val model = KMeans.train(features, numClusters,
numIterations)

Lab — K-means Clustering: Obtaining Type Predictions
● val predictions = features.map ( feature => model.predict
(feature))

Lab — K-means Clustering: Comparing To Labels
● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)
● counts.collect

Lab — Next Steps
● Looks good, right? Let’s look at what the labels for each point
really are.
● val features = featureCols.rdd.map { case Row(t: Double, a:
Double, h: Double, p: Double) => (t,Vectors.dense(a,h,p)) }
● val predictions = features.map ( feature => (feature._1,
model.predict(feature._2)))
● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)
● counts.collect
● A slightly different story!

Lab — Next Steps
● k-means clustering - useful! But not perfect!
● Try again with more features in the vector and see if it
improves the clustering.
● Bayes? Random Forests? All in MLLib and with similar
interfaces!

Lab — Next Steps
● spark.apache.org

Lab — Questions
● ?

Intro to Apache Spark - Lab

More Related Content

What's hot (7)

Viewers also liked (18)

Similar to Intro to Apache Spark - Lab (20)

More from Mammoth Data (6)

Recently uploaded (20)

Intro to Apache Spark - Lab