SlideShare a Scribd company logo
Big Data made
easy with a
Open Source 101
Columbia, SC
April 18th 2019
Jean Georges Perrin
Software since 1983
Big Data since 1984

x11
@jgperrin • http://jgp.net [blog]
News…
๏ Director of Engineering for WeExperience
๏ Hiring a team of talented engineers to work with us
๏ Front end
๏ Mobile
๏ Back end & data
๏ AI
๏ Shoot at @jgperrin
Big data made easy with a Spark
Caution
Hands-on tutorial
Tons of content
Unknown crowd
Unknown setting
Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
Who are thou?
๏ Experience with Spark?
๏ Experience with Hadoop?
๏ Experience with Scala?
๏ Java?
๏ PHP guru?
๏ Front-end developer?
But most importantly…
๏ … who is not a developer?
๏ What is Big Data?
๏ What is. ?
๏ What can I do with. ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ Let’s do AI!
๏ Going further
Agenda
3
V4
5
Biiiiiiiig Data
๏ volume
๏ variety
๏ velocity
๏ variability
๏ value
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
Data is
considered big
when they need
more than one
computer to be
processed
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
Title TextAnalytics operating system
Apps
Analytics
Distrib.
An analytics operating system?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
Some use cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ Lumeris
๏ General compute
๏ Distributed data transfer/pipeline
๏ CERN
๏ Analysis of the science experiments in the LHC - Large Hadron Collider
๏ IBM
๏ Watson Data Studio
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ And much more…
What a typical app looks like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
Convinced?
On y va!
http://bit.ly/spark-clego
Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
Download some tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen or later
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
Aren’t you glad we are
using Java?
Lab #1 - ingestion
Lab #1 - ingestion
๏ Goal

In a Big Data project, ingestion is the first operation.
You get the data “in.”
๏ Source code

https://github.com/jgperrin/
net.jgp.books.spark.ch01
Getting deeper
๏ Go to net.jgp.books.spark.ch01
๏ Open CsvToDataframeApp.java
๏ Right click, Run As, Java Application
+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
+---+--------+--------------------+-----------+--------------------+
only showing top 5 rows
package net.jgp.books.sparkWithJava.ch01;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDataframeApp {
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
/jgperrin/net.jgp.books.sparkWithJava.ch01
So what happened?
Let’s try to understand a little more
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Your application
Dataframe
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Unified API
Title Text Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Dataframe
Lab #2 - a bit of analytics
But really just a bit
Lab #2 - a little bit of analytics
๏ Goal

From two datasets, one containing books, the other
authors, list the authors with most books, by
number of books
๏ Source code

https://github.com/jgperrin/net.jgp.labs.spark
If it was in a relational database
books.csv
authors.csv
id: integer
name: string
link: string
wikipedia: string
id: integer
authorId: integer
title: string
releaseDate: string
link: string
Basic analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
+---+-------------------+--------------------+-----+
| id| name| link|count|
+---+-------------------+--------------------+-----+
| 1| J. K. Rowling|http://amzn.to/2l...| 4|
| 12|William Shakespeare|http://amzn.to/2j...| 3|
| 4| Denis Diderot|http://amzn.to/2i...| 2|
| 6| Craig Walls|http://amzn.to/2A...| 2|
| 2|Jean Georges Perrin|http://amzn.to/2w...| 2|
| 3| Mark Twain|http://amzn.to/2v...| 2|
| 11| Alan Mycroft|http://amzn.to/2A...| 1|
| 10| Mario Fusco|http://amzn.to/2A...| 1|
…
+---+-------------------+--------------------+-----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- link: string (nullable = true)
|-- count: long (nullable = false)
package net.jgp.labs.spark.l200_join.l030_count_books;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class AuthorsAndBooksCountBooksApp {
public static void main(String[] args) {
AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("Authors and Books")
.master("local").getOrCreate();
String filename = "data/authors.csv";
Dataset<Row> authorsDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
/jgperrin/net.jgp.labs.spark
filename = "data/books.csv";
Dataset<Row> booksDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
Dataset<Row> libraryDf = authorsDf
.join(
booksDf,
authorsDf.col("id").equalTo(booksDf.col("authorId")),
"left")
.withColumn("bookId", booksDf.col("id"))
.drop(booksDf.col("id"))
.groupBy(
authorsDf.col("id"),
authorsDf.col("name"),
authorsDf.col("link"))
.count();
libraryDf = libraryDf
.orderBy(libraryDf.col("count").desc());
libraryDf.show();
libraryDf.printSchema();
}
}
/jgperrin/net.jgp.labs.spark
The art of delegating
Slave (Worker)
Driver Master
Cluster Manager
Slave (Worker)
Your app
Executor
Task
Task
Executor
Task
Task
Lab #3 - an even smaller bit of AI
But really just a bit
Title Text
What’s AI

anyway?
Popular beliefs
๏ Robot with human-like behavior
๏ HAL from 2001
๏ Isaac Asimov
๏ Potential ethic problems
General AI Narrow AI
๏ Lots of mathematics
๏ Heavy calculations
๏ Algorithms
๏ Self-driving cars
Current state-of-the-art
Title Text
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
๏ Common algorithms
๏Linear and logistic regressions
๏Classification and regression trees
๏K-nearest neighbors (KNN)
๏Deep learning
๏Subset of ML
๏Artificial neural networks (ANNs)
๏Super CPU intensive, use of GPU
Machine learning
There are two kinds of data scientists:
1) Those who can extrapolate from incomplete data.
Title TextDATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
Title Text
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
All over again
As goes the old adage:
Garbage In,
Garbage Out
xkcd
Lab #3 - correcting and extrapolating data
Lab #3 - projecting data
๏ Goal

As a restaurant manager, I want to predict how
much revenue will bring a party of 40
๏ Source code

https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
If everything was as simple…
Dinner
revenue per
number of
guests
…as a visual representation
Anomaly #1
Anomaly #2
I love it when a plan comes together
Load & Format
+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
…
+-----+-----+
only showing top 20 rows
----
1st DQ rule
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
…
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
…
+-----+-----+------------+
…
+-----+-----+-----+--------+
|guest|price|label|features|
+-----+-----+-----+--------+
| 1| 23.1| 23.1| [1.0]|
| 2| 30.0| 30.0| [2.0]|
…
+-----+-----+-----+--------+
only showing top 20 rows
…
RMSE: 2.802192495300457
r2: 0.9965340953376102
Intersection: 20.979190460591575
Regression parameter: 1.0
Tol: 1.0E-6
Prediction for 40.0 guests is 218.00351106373822
Using existing data quality rules
package net.jgp.labs.sparkdq4ml.dq.udf;


import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;


public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
Telling Spark to use my DQ rules
SparkSession spark = SparkSession.builder()
.appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
spark.udf().register(
"priceCorrelationRule",
new PriceCorrelationDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
Loading my dataset
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|   1|23.24|
|    2|30.89|
|    2|33.74|
|    3|34.89|
|    3|29.91|
|    3| 38.0|
|    4| 40.0|
|    5|120.0|
|    6| 50.0|
|    6|112.0|
|    8| 60.0|
|    8|127.0|
|    8|120.0|
|    9|130.0|
+-----+-----+
Raw data, contains the anomalies
Apply the rules
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
|    1| 23.1|        23.1|
|    2| 30.0|        30.0|
|    2| 33.0|        33.0|
|    3| 34.0|        34.0|
|   24|142.0|       142.0|
|   24|138.0|       138.0|
|   25|  3.0|        -1.0|
|   26| 10.0|        -1.0|
|   25| 15.0|        -1.0|
|   26|  4.0|        -1.0|
|   28| 10.0|        -1.0|
|   28|158.0|       158.0|
|   30|170.0|       170.0|
|   31|180.0|       180.0|
+-----+-----+------------+
Anomalies are clearly identified by -1, so they
can be easily filtered
Filtering out anomalies
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|    1| 23.1|
|    2| 30.0|
|    2| 33.0|
|    3| 34.0|
|    3| 30.0|
|    4| 40.0|
|   19|110.0|
|   20|120.0|
|   22|131.0|
|   24|142.0|
|   24|138.0|
|   28|158.0|
|   30|170.0|
|   31|180.0|
+-----+-----+
Useable data
Format the data for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT
Format the data for ML
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));


// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+-----+--------+------------------+
|guest|price|label|features|        prediction|
+-----+-----+-----+--------+------------------+
|    1| 23.1| 23.1|   [1.0]|24.563807596513133|
|    2| 30.0| 30.0|   [2.0]|29.595283312577884|
|    2| 33.0| 33.0|   [2.0]|29.595283312577884|
|    3| 34.0| 34.0|   [3.0]| 34.62675902864264|
|    3| 30.0| 30.0|   [3.0]| 34.62675902864264|
|    3| 38.0| 38.0|   [3.0]| 34.62675902864264|
|    4| 40.0| 40.0|   [4.0]| 39.65823474470739|
|   14| 89.0| 89.0|  [14.0]| 89.97299190535493|
|   16|102.0|102.0|  [16.0]|100.03594333748444|
|   20|120.0|120.0|  [20.0]|120.16184620174346|
|   22|131.0|131.0|  [22.0]|130.22479763387295|
|   24|142.0|142.0|  [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Prediction for 40 guests
(the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
It’s all about the base model
Same model
Trainer ModelDataset #1
ModelDataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
Conclusion
A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
Key takeaways
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
๏ Spark is easily extensible
Going further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ fb.com/TriangleSpark
๏ Start a Spark meetup in Columbia, SC?
Going further
Spark in action (Second edition, MEAP)
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
sprkans-681D
sprkans-7538
ctwopen10119
40% off
One
two free books
Thanks
@jgperrin
Backup
Spark in Action
Second edition, MEAP
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
Credits
Photos by Pexels
IBM PC XT by Ruben de Rijcke - http://dendmedia.com/
vintage/ - Own work, CC BY 3.0, https://
commons.wikimedia.org/w/index.php?curid=3610862
Illustrations © Jean Georges Perrin
No more slides
You’re on your own!

More Related Content

PDF
Big Data made easy with a Spark
PDF
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
PDF
In graph we trust: Microservices, GraphQL and security challenges
PDF
Improving data interoperability in Python and R
PDF
4Developers 2015: Lessons for Erlang VM - Michał Ślaski
PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
PDF
System design for Web Application
PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Big Data made easy with a Spark
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
In graph we trust: Microservices, GraphQL and security challenges
Improving data interoperability in Python and R
4Developers 2015: Lessons for Erlang VM - Michał Ślaski
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
System design for Web Application
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

What's hot (20)

PPTX
Which Freaking Database Should I Use?
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
PPTX
Python Raster Function - Esri Developer Conference - 2015
PPTX
DESIGN West 2013 Presentation: Accelerating Android Development and Delivery
PPT
Introduction to the intermediate Python - v1.1
PDF
De-mystifying contributing to PostgreSQL
PDF
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
PDF
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
PPT
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
PDF
Introduction to Web Development with Ruby on Rails
PDF
From NASA to Startups to Big Commerce
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
PPT
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
PDF
Deployments in one click!
PDF
Contributing to Apache Spark 3
PDF
Apache Toree
Which Freaking Database Should I Use?
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Python Raster Function - Esri Developer Conference - 2015
DESIGN West 2013 Presentation: Accelerating Android Development and Delivery
Introduction to the intermediate Python - v1.1
De-mystifying contributing to PostgreSQL
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Introduction to Web Development with Ruby on Rails
From NASA to Startups to Big Commerce
Powering tensor flow with big data using apache beam, flink, and spark cern...
Intro - End to end ML with Kubeflow @ SignalConf 2018
PySpark on Kubernetes @ Python Barcelona March Meetup
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Deployments in one click!
Contributing to Apache Spark 3
Apache Toree
Ad

Similar to Big data made easy with a Spark (20)

PDF
The road to AI is paved with pragmatic intentions
PDF
Apache Spark v3.0.0
PDF
Started with-apache-spark
PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
PPTX
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
PPTX
Architecting an Open Source AI Platform 2018 edition
PPTX
In Memory Analytics with Apache Spark
PDF
20170126 big data processing
PPTX
Big Data tools in practice
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
How Apache Spark fits in the Big Data landscape
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
PPTX
Big Data - An Overview
PPTX
BDI- The Beginning (Big data training in Coimbatore)
PDF
Spark hands-on tutorial (rev. 002)
PDF
2951085 dzone-2016guidetobigdata
PDF
Dba to data scientist -Satyendra
PDF
Big data processing with apache spark
PDF
Data Science meets Software Development
The road to AI is paved with pragmatic intentions
Apache Spark v3.0.0
Started with-apache-spark
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Architecting an Open Source AI Platform 2018 edition
In Memory Analytics with Apache Spark
20170126 big data processing
Big Data tools in practice
Apache Spark for Everyone - Women Who Code Workshop
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
How Apache Spark fits in the Big Data landscape
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data - An Overview
BDI- The Beginning (Big data training in Coimbatore)
Spark hands-on tutorial (rev. 002)
2951085 dzone-2016guidetobigdata
Dba to data scientist -Satyendra
Big data processing with apache spark
Data Science meets Software Development
Ad

More from Jean-Georges Perrin (20)

PDF
It's painful how much data rules the world
PDF
Why i love Apache Spark?
PDF
Spark Summit Europe Wrap Up and TASM State of the Community
PDF
Spark Summit 2017 - A feedback for TASM
PDF
HTML (or how the web got started)
PPTX
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
PDF
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
KEY
Informix is not for legacy applications
PDF
Vendre des produits techniques
PDF
Vendre plus sur le web
KEY
Vendre plus sur le Web
PDF
GreenIvory : products and services
PDF
GreenIvory : produits & services
PDF
A la découverte des nouvelles tendances du web (Mulhouse Edition)
KEY
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
PPTX
MashupXFeed et le référencement - Workshop Activis - Greenivory
PPT
Présentation e-réputation lors des Nord IT Days
KEY
Tendances Web 2011 San Francicsco
KEY
Le contenu est un réel levier de croissance. On en a la preuve !
PDF
Retour de la conférence O'Reilly Web 2.0 2009
It's painful how much data rules the world
Why i love Apache Spark?
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit 2017 - A feedback for TASM
HTML (or how the web got started)
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Informix is not for legacy applications
Vendre des produits techniques
Vendre plus sur le web
Vendre plus sur le Web
GreenIvory : products and services
GreenIvory : produits & services
A la découverte des nouvelles tendances du web (Mulhouse Edition)
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et le référencement - Workshop Activis - Greenivory
Présentation e-réputation lors des Nord IT Days
Tendances Web 2011 San Francicsco
Le contenu est un réel levier de croissance. On en a la preuve !
Retour de la conférence O'Reilly Web 2.0 2009

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Global journeys: estimating international migration
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Mega Projects Data Mega Projects Data
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Business Data Analytics.
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
1_Introduction to advance data techniques.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Global journeys: estimating international migration
Major-Components-ofNKJNNKNKNKNKronment.pptx
Moving the Public Sector (Government) to a Digital Adoption
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Launch Your Data Science Career in Kochi – 2025
Mega Projects Data Mega Projects Data
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
Introduction to Knowledge Engineering Part 1
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Business Data Analytics.
Business Acumen Training GuidePresentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
1_Introduction to advance data techniques.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Big data made easy with a Spark

  • 1. Big Data made easy with a Open Source 101 Columbia, SC April 18th 2019
  • 2. Jean Georges Perrin Software since 1983 Big Data since 1984
 x11 @jgperrin • http://jgp.net [blog]
  • 3. News… ๏ Director of Engineering for WeExperience ๏ Hiring a team of talented engineers to work with us ๏ Front end ๏ Mobile ๏ Back end & data ๏ AI ๏ Shoot at @jgperrin
  • 5. Caution Hands-on tutorial Tons of content Unknown crowd Unknown setting
  • 6. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  • 7. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  • 8. But most importantly… ๏ … who is not a developer?
  • 9. ๏ What is Big Data? ๏ What is. ? ๏ What can I do with. ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ Let’s do AI! ๏ Going further Agenda
  • 10. 3 V4 5 Biiiiiiiig Data ๏ volume ๏ variety ๏ velocity ๏ variability ๏ value Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 11. Data is considered big when they need more than one computer to be processed Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 13. Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  • 14. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 15. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 16. Some use cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ Lumeris ๏ General compute ๏ Distributed data transfer/pipeline ๏ CERN ๏ Analysis of the science experiments in the LHC - Large Hadron Collider ๏ IBM ๏ Watson Data Studio ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ And much more…
  • 17. What a typical app looks like? Connect to the cluster Load Data Do something with the data Share the results
  • 20. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  • 21. Download some tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen or later ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  • 22. Aren’t you glad we are using Java?
  • 23. Lab #1 - ingestion
  • 24. Lab #1 - ingestion ๏ Goal
 In a Big Data project, ingestion is the first operation. You get the data “in.” ๏ Source code
 https://github.com/jgperrin/ net.jgp.books.spark.ch01
  • 25. Getting deeper ๏ Go to net.jgp.books.spark.ch01 ๏ Open CsvToDataframeApp.java ๏ Right click, Run As, Java Application
  • 26. +---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| +---+--------+--------------------+-----------+--------------------+ only showing top 5 rows
  • 27. package net.jgp.books.sparkWithJava.ch01; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDataframeApp { public static void main(String[] args) { CsvToDataframeApp app = new CsvToDataframeApp(); app.start(); } private void start() { // Creates a session on a local master SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); // Reads a CSV file with header, called books.csv, stores it in a dataframe Dataset<Row> df = spark.read().format("csv") .option("header", "true") .load("data/books.csv"); // Shows at most 5 rows from the dataframe df.show(5); } } /jgperrin/net.jgp.books.sparkWithJava.ch01
  • 28. So what happened? Let’s try to understand a little more
  • 30. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  • 31. Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Your application Dataframe Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS Unified API
  • 32. Title Text Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Dataframe
  • 33. Lab #2 - a bit of analytics But really just a bit
  • 34. Lab #2 - a little bit of analytics ๏ Goal
 From two datasets, one containing books, the other authors, list the authors with most books, by number of books ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.spark
  • 35. If it was in a relational database books.csv authors.csv id: integer name: string link: string wikipedia: string id: integer authorId: integer title: string releaseDate: string link: string
  • 36. Basic analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  • 37. +---+-------------------+--------------------+-----+ | id| name| link|count| +---+-------------------+--------------------+-----+ | 1| J. K. Rowling|http://amzn.to/2l...| 4| | 12|William Shakespeare|http://amzn.to/2j...| 3| | 4| Denis Diderot|http://amzn.to/2i...| 2| | 6| Craig Walls|http://amzn.to/2A...| 2| | 2|Jean Georges Perrin|http://amzn.to/2w...| 2| | 3| Mark Twain|http://amzn.to/2v...| 2| | 11| Alan Mycroft|http://amzn.to/2A...| 1| | 10| Mario Fusco|http://amzn.to/2A...| 1| … +---+-------------------+--------------------+-----+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- link: string (nullable = true) |-- count: long (nullable = false)
  • 38. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); /jgperrin/net.jgp.labs.spark
  • 39. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf = libraryDf .orderBy(libraryDf.col("count").desc()); libraryDf.show(); libraryDf.printSchema(); } } /jgperrin/net.jgp.labs.spark
  • 40. The art of delegating
  • 41. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  • 42. Lab #3 - an even smaller bit of AI But really just a bit
  • 44. Popular beliefs ๏ Robot with human-like behavior ๏ HAL from 2001 ๏ Isaac Asimov ๏ Potential ethic problems General AI Narrow AI ๏ Lots of mathematics ๏ Heavy calculations ๏ Algorithms ๏ Self-driving cars Current state-of-the-art
  • 45. Title Text I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  • 46. ๏ Common algorithms ๏Linear and logistic regressions ๏Classification and regression trees ๏K-nearest neighbors (KNN) ๏Deep learning ๏Subset of ML ๏Artificial neural networks (ANNs) ๏Super CPU intensive, use of GPU Machine learning
  • 47. There are two kinds of data scientists: 1) Those who can extrapolate from incomplete data.
  • 48. Title TextDATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  • 49. Title Text Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  • 50. All over again As goes the old adage: Garbage In, Garbage Out xkcd
  • 51. Lab #3 - correcting and extrapolating data
  • 52. Lab #3 - projecting data ๏ Goal
 As a restaurant manager, I want to predict how much revenue will bring a party of 40 ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  • 53. If everything was as simple… Dinner revenue per number of guests
  • 54. …as a visual representation Anomaly #1 Anomaly #2
  • 55. I love it when a plan comes together
  • 56. Load & Format +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| … +-----+-----+ only showing top 20 rows ---- 1st DQ rule +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| … | 25| 3.0| -1.0| | 26| 10.0| -1.0| … +-----+-----+------------+ … +-----+-----+-----+--------+ |guest|price|label|features| +-----+-----+-----+--------+ | 1| 23.1| 23.1| [1.0]| | 2| 30.0| 30.0| [2.0]| … +-----+-----+-----+--------+ only showing top 20 rows … RMSE: 2.802192495300457 r2: 0.9965340953376102 Intersection: 20.979190460591575 Regression parameter: 1.0 Tol: 1.0E-6 Prediction for 40.0 guests is 218.00351106373822
  • 57. Using existing data quality rules package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  • 58. Telling Spark to use my DQ rules SparkSession spark = SparkSession.builder() .appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); spark.udf().register( "priceCorrelationRule", new PriceCorrelationDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 59. Loading my dataset String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 60. +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+ Raw data, contains the anomalies
  • 61. Apply the rules String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 62. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+ Anomalies are clearly identified by -1, so they can be easily filtered
  • 63. Filtering out anomalies String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 64. +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+ Useable data
  • 65. Format the data for ML ๏ Convert/Adapt dataset to Features and Label ๏ Required for Linear Regression in MLlib ๏Needs a column called label of type double ๏Needs a column called features of type VectorUDT
  • 66. Format the data for ML spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 67. +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852 Prediction for 40 guests
  • 68. (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  • 69. It’s all about the base model Same model Trainer ModelDataset #1 ModelDataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase
  • 71. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  • 72. Key takeaways ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun ๏ Spark is easily extensible
  • 73. Going further ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ fb.com/TriangleSpark ๏ Start a Spark meetup in Columbia, SC?
  • 74. Going further Spark in action (Second edition, MEAP) by Jean Georges Perrin published by Manning http://jgp.net/sia sprkans-681D sprkans-7538 ctwopen10119 40% off One two free books
  • 77. Spark in Action Second edition, MEAP by Jean Georges Perrin published by Manning http://jgp.net/sia
  • 78. Credits Photos by Pexels IBM PC XT by Ruben de Rijcke - http://dendmedia.com/ vintage/ - Own work, CC BY 3.0, https:// commons.wikimedia.org/w/index.php?curid=3610862 Illustrations © Jean Georges Perrin
  • 79. No more slides You’re on your own!