SlideShare a Scribd company logo
Introduction to Apache Spark
www.mammothdata.com | @mammothdataco
Lab Overview
● ‘Hello world’ RDD example
● Importing a dataset
● Dataframe operations and visualizations
● Using MLLib on dataset
www.mammothdata.com | @mammothdataco
Lab — Hello World
● ./run_spark
www.mammothdata.com | @mammothdataco
Lab — Hello World
● val text = sc.parallelize(Seq(“your text here”))
● val words = text.flatMap(line => line.split(" "))
● words.collect
www.mammothdata.com | @mammothdataco
Lab — Hello World
● val taggedWords = words.map(word => (word,1))
● val counts = taggedWords.reduceByKey(_ + _)
● counts.collect()
www.mammothdata.com | @mammothdataco
Lab — Dataset
● https://archive.ics.uci.edu/ml/datasets/Wine
● Information on 3 different types of wine from Genoa
● 178 entries (small!)
www.mammothdata.com | @mammothdataco
Lab — Loading The Wine Dataset
● val wines = sqlContext.read.json("wine.json")
● wines.registerTempTable(“wines”)
www.mammothdata.com | @mammothdataco
Lab — Showing the generated Schema
● wines.printSchema
www.mammothdata.com | @mammothdataco
Lab — Dataframe Operations
● wines.first
www.mammothdata.com | @mammothdataco
Lab — Dataframe Operations
● sqlContext.sql("SELECT Type, count(Type) AS count FROM
wines GROUP BY Type").show
www.mammothdata.com | @mammothdataco
Lab — Dataframe Operations
● Experiment with %sql on the dataset (SELECT, COUNT, etc)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering
● K-Means clustering is an unsupervised algorithm which splits a
dataset into a number of clusters (k) based on a notion of
similarity between points. It is often applied to real-world data
to obtain a picture of structure hidden in large datasets, for
example, identifying location clusters or breaking down sales
into distinct purchasing groups.
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering
k initial "means" (in this case k=3)
are randomly generated within the
data domain (shown in colour).
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering
k (in this case, 3) clusters are
created by comparing each data
point to the closest mean.
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering
The centroid of each of these
clusters is found, and these are
used as new means. New clusters
are formed via observing the
closest data points to these new
mean as shown in Step 2. The
process is repeated until the means
converge (or until we hit our
iteration limit)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Imports
● import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
● import org.apache.spark.sql._
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Features
● val featureCols = wines.select("Alcohol", "Hue", "Proline")
● val features = featureCols.rdd.map { case Row(a: Double, h:
Double, p: Double) => Vectors.dense(a,h,p) }
● features.cache
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Training Model
● val numClusters = 2
● val numIterations = 20
● val model = KMeans.train(features, numClusters,
numIterations)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Finding k
● k can be any number you like!
● WSSSE - Within Set Sum of Squared Error
● Squared sum of distances between points and their respective
centroid
● val wssse = model.computeCost(features)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Finding k
● Test on k = 1 to 5
● (1 to 5 by 1).map (k => KMeans.train(features, k,
numIterations).computeCost(features))
● WSSSE normally decreases as k increases
● Look for the ‘elbow’
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Training Model
● val numClusters = 1
● val numIterations = 20
● val wssse = KMeans.train(features, numClusters,
numIterations).computeCost(features)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: k = 3
● val numClusters = 3
● val numIterations = 10
● val model = KMeans.train(features, numClusters,
numIterations)
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Obtaining Type Predictions
● val predictions = features.map ( feature => model.predict
(feature))
www.mammothdata.com | @mammothdataco
Lab — K-means Clustering: Comparing To Labels
● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)
● counts.collect
www.mammothdata.com | @mammothdataco
Lab — Next Steps
● Looks good, right? Let’s look at what the labels for each point
really are.
● val features = featureCols.rdd.map { case Row(t: Double, a:
Double, h: Double, p: Double) => (t,Vectors.dense(a,h,p)) }
● val predictions = features.map ( feature => (feature._1,
model.predict(feature._2)))
● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)
● counts.collect
● A slightly different story!
www.mammothdata.com | @mammothdataco
Lab — Next Steps
● k-means clustering - useful! But not perfect!
● Try again with more features in the vector and see if it
improves the clustering.
● Bayes? Random Forests? All in MLLib and with similar
interfaces!
www.mammothdata.com | @mammothdataco
Lab — Next Steps
● spark.apache.org
www.mammothdata.com | @mammothdataco
Lab — Questions
● ?

More Related Content

PDF
Time Series Data with Apache Cassandra
PDF
It's not you, it's me: Ending a 15 year relationship with RRD
PDF
Time series storage in Cassandra
PPTX
Introduction tomongodb
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
Climate data in r with the raster package
PDF
Clojure - LISP on the JVM
PDF
Sampling based Histogram in MariaDB
Time Series Data with Apache Cassandra
It's not you, it's me: Ending a 15 year relationship with RRD
Time series storage in Cassandra
Introduction tomongodb
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Climate data in r with the raster package
Clojure - LISP on the JVM
Sampling based Histogram in MariaDB

What's hot (7)

PDF
Sperasoft‬ talks j point 2015
PPTX
October 2013 BARUG Lightning Talk
PDF
My Gentle Introduction to RxJS
PDF
Ruby memory tips and tricks
PDF
spaCy lightning talk for KyivPy #21
PDF
Demonstration
PDF
High performance GPU computing with Ruby
Sperasoft‬ talks j point 2015
October 2013 BARUG Lightning Talk
My Gentle Introduction to RxJS
Ruby memory tips and tricks
spaCy lightning talk for KyivPy #21
Demonstration
High performance GPU computing with Ruby
Ad

Viewers also liked (18)

PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PDF
Catalogo Planet Network da Spark Controles
PPTX
Apache poi
PDF
Apache Poi Recipes
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
Apache Spark streaming and HBase
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata EU 2014: Spark Streaming Case Studies
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PPTX
Spark machine learning & deep learning
PDF
Maximilian Michels - Flink and Beam
PDF
How to deploy Apache Spark 
to Mesos/DCOS
PDF
Machine Learning by Example - Apache Spark
PDF
Reactive dashboard’s using apache spark
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Catalogo Planet Network da Spark Controles
Apache poi
Apache Poi Recipes
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Apache Spark streaming and HBase
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata EU 2014: Spark Streaming Case Studies
Implementing the Lambda Architecture efficiently with Apache Spark
Spark machine learning & deep learning
Maximilian Michels - Flink and Beam
How to deploy Apache Spark 
to Mesos/DCOS
Machine Learning by Example - Apache Spark
Reactive dashboard’s using apache spark
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Big Data and Fast Data - Lambda Architecture in Action
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Ad

Similar to Intro to Apache Spark - Lab (20)

PDF
Unsupervised Learning with Apache Spark
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
PPTX
Anomaly Detection with Apache Spark
PDF
DutchMLSchool. Clusters and Anomalies
PPTX
Spark MLlib - Training Material
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
VSSML17 L3. Clusters and Anomaly Detection
PPTX
Large Scale Machine Learning with Apache Spark
PDF
Scaling Analytics with Apache Spark
PDF
BSSML16 L3. Clusters and Anomaly Detection
PDF
Spark ml streaming
PDF
The Rise of the Machines - A Primer to Machine Learning and Predictive Analyt...
PDF
MLlib: Spark's Machine Learning Library
PPTX
05 k-means clustering
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Musings of kaggler
PDF
NYC_2016_slides
PDF
Machine Learning for (JVM) Developers
PDF
Large-Scale Machine Learning with Apache Spark
PPTX
Intro to Apache Spark
Unsupervised Learning with Apache Spark
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Anomaly Detection with Apache Spark
DutchMLSchool. Clusters and Anomalies
Spark MLlib - Training Material
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
VSSML17 L3. Clusters and Anomaly Detection
Large Scale Machine Learning with Apache Spark
Scaling Analytics with Apache Spark
BSSML16 L3. Clusters and Anomaly Detection
Spark ml streaming
The Rise of the Machines - A Primer to Machine Learning and Predictive Analyt...
MLlib: Spark's Machine Learning Library
05 k-means clustering
Recent Developments in Spark MLlib and Beyond
Musings of kaggler
NYC_2016_slides
Machine Learning for (JVM) Developers
Large-Scale Machine Learning with Apache Spark
Intro to Apache Spark

More from Mammoth Data (6)

PPTX
A Modern Data Architecture for Risk Management... For Financial Services
PPTX
2015 Red Hat Summit - Open Source in Financial Services
PPTX
How To Run A Successful BI Project with Hadoop
PPTX
Cloud Worst Practices
PPTX
A Gentle Introduction To Storm And Kafka
PPTX
Become Data Driven With Hadoop as-a-Service
A Modern Data Architecture for Risk Management... For Financial Services
2015 Red Hat Summit - Open Source in Financial Services
How To Run A Successful BI Project with Hadoop
Cloud Worst Practices
A Gentle Introduction To Storm And Kafka
Become Data Driven With Hadoop as-a-Service

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx

Intro to Apache Spark - Lab