Big data made easy with a Spark

Big Data made
easy with a
Open Source 101
Columbia, SC
April 18th 2019

Jean Georges Perrin
Software since 1983
Big Data since 1984 
x11
@jgperrin • http://jgp.net [blog]

News…
๏ Director of Engineering for WeExperience
๏ Hiring a team of talented engineers to work with us
๏ Front end
๏ Mobile
๏ Back end & data
๏ AI
๏ Shoot at @jgperrin

Caution
Hands-on tutorial
Tons of content
Unknown crowd
Unknown setting

Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code

Who are thou?
๏ Experience with Spark?
๏ Experience with Hadoop?
๏ Experience with Scala?
๏ Java?
๏ PHP guru?
๏ Front-end developer?

But most importantly…
๏ … who is not a developer?

๏ What is Big Data?
๏ What is. ?
๏ What can I do with. ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A ﬁrst example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ Let’s do AI!
๏ Going further
Agenda

3
V4
5
Biiiiiiiig Data
๏ volume
๏ variety
๏ velocity
๏ variability
๏ value
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data

Data is
considered big
when they need
more than one
computer to be
processed
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data

Title TextAnalytics operating system

Apps
Analytics
Distrib.
An analytics operating system?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS

An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{

Some use cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ Lumeris
๏ General compute
๏ Distributed data transfer/pipeline
๏ CERN
๏ Analysis of the science experiments in the LHC - Large Hadron Collider
๏ IBM
๏ Watson Data Studio
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ And much more…

What a typical app looks like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results

http://bit.ly/spark-clego

Download some tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen or later
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/

Aren’t you glad we are
using Java?

Lab #1 - ingestion
๏ Goal 
In a Big Data project, ingestion is the ﬁrst operation.
You get the data “in.”
๏ Source code 
https://github.com/jgperrin/
net.jgp.books.spark.ch01

Getting deeper
๏ Go to net.jgp.books.spark.ch01
๏ Open CsvToDataframeApp.java
๏ Right click, Run As, Java Application

+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
+---+--------+--------------------+-----------+--------------------+
only showing top 5 rows

package net.jgp.books.sparkWithJava.ch01;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDataframeApp {
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
/jgperrin/net.jgp.books.sparkWithJava.ch01

So what happened?
Let’s try to understand a little more

Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark

Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark streaming
Machine learning
& deep learning
& artiﬁcial intelligence
GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Uniﬁed API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW

Spark SQL
Spark streaming
Machine learning
& deep learning
GraphX
Your application
Dataframe
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Uniﬁed API

Title Text Spark SQL
Spark streaming
Machine learning
& deep learning
GraphX
Dataframe

Lab #2 - a bit of analytics
But really just a bit

Lab #2 - a little bit of analytics
๏ Goal 
From two datasets, one containing books, the other
authors, list the authors with most books, by
number of books
๏ Source code 
https://github.com/jgperrin/net.jgp.labs.spark

If it was in a relational database
books.csv
authors.csv
id: integer
name: string
link: string
wikipedia: string
id: integer
authorId: integer
title: string
releaseDate: string
link: string

Basic analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application

+---+-------------------+--------------------+-----+
| id| name| link|count|
+---+-------------------+--------------------+-----+
| 1| J. K. Rowling|http://amzn.to/2l...| 4|
| 12|William Shakespeare|http://amzn.to/2j...| 3|
| 4| Denis Diderot|http://amzn.to/2i...| 2|
| 6| Craig Walls|http://amzn.to/2A...| 2|
| 2|Jean Georges Perrin|http://amzn.to/2w...| 2|
| 3| Mark Twain|http://amzn.to/2v...| 2|
| 11| Alan Mycroft|http://amzn.to/2A...| 1|
| 10| Mario Fusco|http://amzn.to/2A...| 1|
…
+---+-------------------+--------------------+-----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- link: string (nullable = true)
|-- count: long (nullable = false)

package net.jgp.labs.spark.l200_join.l030_count_books;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class AuthorsAndBooksCountBooksApp {
public static void main(String[] args) {
AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp();
app.start();
}
private void start() {
.appName("Authors and Books")
.master("local").getOrCreate();
String filename = "data/authors.csv";
Dataset<Row> authorsDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.load(filename);
/jgperrin/net.jgp.labs.spark

filename = "data/books.csv";
Dataset<Row> booksDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.load(filename);
Dataset<Row> libraryDf = authorsDf
.join(
booksDf,
authorsDf.col("id").equalTo(booksDf.col("authorId")),
"left")
.withColumn("bookId", booksDf.col("id"))
.drop(booksDf.col("id"))
.groupBy(
authorsDf.col("id"),
authorsDf.col("name"),
authorsDf.col("link"))
.count();
libraryDf = libraryDf
.orderBy(libraryDf.col("count").desc());
libraryDf.show();
libraryDf.printSchema();
}
}
/jgperrin/net.jgp.labs.spark

Slave (Worker)
Driver Master
Cluster Manager
Slave (Worker)
Your app
Executor
Task
Task
Executor
Task
Task

Lab #3 - an even smaller bit of AI
But really just a bit

Title Text
What’s AI 
anyway?

Popular beliefs
๏ Robot with human-like behavior
๏ HAL from 2001
๏ Isaac Asimov
๏ Potential ethic problems
General AI Narrow AI
๏ Lots of mathematics
๏ Heavy calculations
๏ Algorithms
๏ Self-driving cars
Current state-of-the-art

Title Text
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning

๏ Common algorithms
๏Linear and logistic regressions
๏Classiﬁcation and regression trees
๏K-nearest neighbors (KNN)
๏Deep learning
๏Subset of ML
๏Artiﬁcial neural networks (ANNs)
๏Super CPU intensive, use of GPU
Machine learning

There are two kinds of data scientists:
1) Those who can extrapolate from incomplete data.

Title TextDATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.

Title Text
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL

All over again
As goes the old adage:
Garbage In,
Garbage Out
xkcd

Lab #3 - correcting and extrapolating data

Lab #3 - projecting data
๏ Goal 
As a restaurant manager, I want to predict how
much revenue will bring a party of 40
๏ Source code 
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml

If everything was as simple…
Dinner
revenue per
number of
guests

…as a visual representation
Anomaly #1
Anomaly #2

I love it when a plan comes together

Load & Format
+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
…
+-----+-----+
----
1st DQ rule
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
…
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
…
+-----+-----+------------+
…
+-----+-----+-----+--------+
|guest|price|label|features|
+-----+-----+-----+--------+
| 1| 23.1| 23.1| [1.0]|
| 2| 30.0| 30.0| [2.0]|
…
+-----+-----+-----+--------+
…
RMSE: 2.802192495300457
r2: 0.9965340953376102
Intersection: 20.979190460591575
Regression parameter: 1.0
Tol: 1.0E-6
Prediction for 40.0 guests is 218.00351106373822

Using existing data quality rules
package net.jgp.labs.sparkdq4ml.dq.udf;
 
import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;
 
public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1

Telling Spark to use my DQ rules
.appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
"priceCorrelationRule",
new PriceCorrelationDataQualityUdf(),
DataTypes.DoubleType);

Loading my dataset
String filename = "data/dataset.csv";
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…

+-----+-----+
|guest|price|
+-----+-----+
| 1|23.24|
| 2|30.89|
| 2|33.74|
| 3|34.89|
| 3|29.91|
| 3| 38.0|
| 4| 40.0|
| 5|120.0|
| 6| 50.0|
| 6|112.0|
| 8| 60.0|
| 8|127.0|
| 8|120.0|
| 9|130.0|
+-----+-----+
Raw data, contains the anomalies

Apply the rules
.load(filename);
df = df.withColumn(
"price_no_min",
price_no_min > 0");

+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
| 2| 33.0| 33.0|
| 3| 34.0| 34.0|
| 24|142.0| 142.0|
| 24|138.0| 138.0|
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
| 25| 15.0| -1.0|
| 26| 4.0| -1.0|
| 28| 10.0| -1.0|
| 28|158.0| 158.0|
| 30|170.0| 170.0|
| 31|180.0| 180.0|
+-----+-----+------------+
Anomalies are clearly identified by -1, so they
can be easily filtered

Filtering out anomalies
.load(filename);
df = df.withColumn(
"price_no_min",
price_no_min > 0");

+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
| 2| 33.0|
| 3| 34.0|
| 3| 30.0|
| 4| 40.0|
| 19|110.0|
| 20|120.0|
| 22|131.0|
| 24|142.0|
| 24|138.0|
| 28|158.0|
| 30|170.0|
| 31|180.0|
+-----+-----+
Useable data

Format the data for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT

Format the data for ML
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));
 
// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);

+-----+-----+-----+--------+------------------+
|guest|price|label|features| prediction|
+-----+-----+-----+--------+------------------+
| 1| 23.1| 23.1| [1.0]|24.563807596513133|
| 2| 30.0| 30.0| [2.0]|29.595283312577884|
| 2| 33.0| 33.0| [2.0]|29.595283312577884|
| 3| 34.0| 34.0| [3.0]| 34.62675902864264|
| 3| 30.0| 30.0| [3.0]| 34.62675902864264|
| 3| 38.0| 38.0| [3.0]| 34.62675902864264|
| 4| 40.0| 40.0| [4.0]| 39.65823474470739|
| 14| 89.0| 89.0| [14.0]| 89.97299190535493|
| 16|102.0|102.0| [16.0]|100.03594333748444|
| 20|120.0|120.0| [20.0]|120.16184620174346|
| 22|131.0|131.0| [22.0]|130.22479763387295|
| 24|142.0|142.0| [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Prediction for 40 guests

(the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict

It’s all about the base model
Same model
Trainer ModelDataset #1
ModelDataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase

A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data

Key takeaways
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
๏ Spark is easily extensible

Going further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overﬂow
๏ fb.com/TriangleSpark
๏ Start a Spark meetup in Columbia, SC?

Going further
Spark in action (Second edition, MEAP)
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
sprkans-681D
sprkans-7538
ctwopen10119
40% oﬀ
One
two free books

Spark in Action
Second edition, MEAP
by Jean Georges Perrin
published by Manning
http://jgp.net/sia

Credits
Photos by Pexels
IBM PC XT by Ruben de Rijcke - http://dendmedia.com/
vintage/ - Own work, CC BY 3.0, https://
commons.wikimedia.org/w/index.php?curid=3610862
Illustrations © Jean Georges Perrin

No more slides
You’re on your own!

Big data made easy with a Spark

More Related Content

What's hot (20)

Similar to Big data made easy with a Spark (20)

More from Jean-Georges Perrin (20)

Recently uploaded (20)

Big data made easy with a Spark