Leveraging the Spark-HPCC Ecosystem

2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
James McMullan
Sr Software Engineer
LexisNexis Risk Solutions
Leveraging the Spark-HPCC Systems
Ecosystem

Overview
• Spark-HPCC Plugin & Connector
• Basics of reading / writing to / from HPCC Systems
• Brief introduction to Apache Zeppelin
• Create a random forest model in Spark
• Compare to Kaggle competition leaderboard
• Future of Spark-HPCC Systems Ecosystem
• Closing thoughts
Leveraging the Spark-HPCC Systems Ecosystem

Spark - HPCC
Systems Connector

Spark-HPCC Systems – Overview
• Spark-HPCC Systems Connector
• Spark library
• Allows reading and writing to HPCC
Systems
• Can be installed on any Spark
cluster
• Spark Plugin – Managed Spark Cluster
• Requires HPCC Systems 7.0+
• Spark cluster that mirrors Thor
cluster
• Configured through Config Manager
• Installs Spark-HPCC Systems
connector

Spark-HPCC Systems Connector - Progress
• Added support for remote writing
• HPCC Systems 7.2+
• Improved performance
• Scala, Python and R
• Increased reliability
• Lots of testing and bug fixes
• Added support for DataSource API v1
• Unified Read / Write interface

Spark-HPCC Systems Connector – Reading
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Read dataset from HPCC Systems
df = spark.read.load(format="hpcc",
host=clusterURL,
password="",
username="",
limitPerFilePart=100,
projectList="field1, field2",
fileAccessTimeout=240,
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Read dataset from HPCC Systems
df <- read.df(source = "hpcc",
host = clusterURL,
password = "",
username = "",
limitPerFilePart = 100,
projectList = "field1, field2",
fileAccessTimeout = 240,
path = fileName)
PySpark Read Example SparkR Read Example

Spark-HPCC Systems Connector – Writing
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Write dataset to HPCC Systems
df.write.save(format="hpcc",
mode="overwrite",
host=clusterURL,
password="",
username="",
cluster="mythor",
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Write dataset to HPCC Systems
write.df(df, source = "hpcc",
host = clusterURL,
cluster = "mythor",
path = fileName,
mode = "overwrite",
password = "",
username = "",
fileAccessTimeout = 240)
PySpark Write Example SparkR Write Example

Apache Zeppelin - Overview
• Multi-user Notebook Environment
• Front end for Spark
• Collaborative
• Easy to use
• Handles resource management
• Handles job queuing and resource allocation
• We do not support or package Zeppelin

Apache Zeppelin – Features
• Multi-user environment by default
• Version Control
• Interpreters are bound at a Paragraph level
• Allows multiple languages in a single notebook
• Built-in visualization tools
• Ability to move data between languages
• Credential management

Spark ML Model – Brief Intro to Random Forests
• Random Forests: Ensemble of decision
trees
• Averaging output of multiple decision trees
gives a better prediction
• Random Forests requires data to be
numeric

Spark ML Model – Bulldozers R US
• Open source bulldozer auction dataset from Kaggle
• www.kaggle.com/c/bluebook-for-bulldozers
• Create a Random Forest Model to predict auction price
• Compare our model against the Kaggle leaderboard
• Score is calculated by RMSLE (Root Mean Square Log Error)
• RMSLE provides a percentage based error

Spark ML Model – Results
• Our RMSLE ~ 0.26
• Around 50th out of 450 participants
• Not bad for little to no feature engineering
• RMSLE around ~0.22 is possible with Random Forests
• Hyper parameter tuning
• Feature engineering
• Deep Learning can do better than ~0.22

Spark-HPCC Systems – Future & Future Use Cases
• Continued support and improvement
• Leveraging libraries in Spark, Python and R
• Optimus – Data cleaning for Spark
• Matplotlib
• Spark Streaming
• IoT Events
• Telematics
• Deep Learning with Spark
• Possible now through external libraries
• Spark 3.0 will support Tensorflow natively

Closing Thoughts
• Spark-HPCC Systems ecosystem provides new opportunities
• Access to an entire ecosystem of libraries and tools
• Apache Zeppelin is great
• Machine Learning and Deep Learning are accessible
• FastAI MOOC is a great way to learn
• Everyone should learn ML & Deep Learning

Questions?
Spark Plugin:
https://hpccsystems.com/download
Spark-HPCC Systems Connector:
https://github.com/hpcc-systems/Spark-HPCC
Bulldozer Model Notebook:
https://github.com/hpcc-systems/
FastAI:
https://fast.ai

View this presentation on YouTube:
https://www.youtube.com/watch?v=AQF9XP-Hd74&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=4&t=0s
(4:55:00)

Leveraging the Spark-HPCC Ecosystem

More Related Content

What's hot (20)

Similar to Leveraging the Spark-HPCC Ecosystem (20)

More from HPCC Systems (20)

Recently uploaded (20)

Leveraging the Spark-HPCC Ecosystem

Editor's Notes