SlideShare a Scribd company logo
2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
James McMullan
Sr Software Engineer
LexisNexis Risk Solutions
Leveraging the Spark-HPCC Systems
Ecosystem
Overview
• Spark-HPCC Plugin & Connector
• Basics of reading / writing to / from HPCC Systems
• Brief introduction to Apache Zeppelin
• Create a random forest model in Spark
• Compare to Kaggle competition leaderboard
• Future of Spark-HPCC Systems Ecosystem
• Closing thoughts
Leveraging the Spark-HPCC Systems Ecosystem
Spark - HPCC
Systems Connector
Spark-HPCC Systems – Overview
• Spark-HPCC Systems Connector
• Spark library
• Allows reading and writing to HPCC
Systems
• Can be installed on any Spark
cluster
• Spark Plugin – Managed Spark Cluster
• Requires HPCC Systems 7.0+
• Spark cluster that mirrors Thor
cluster
• Configured through Config Manager
• Installs Spark-HPCC Systems
connector
Leveraging the Spark-HPCC Systems Ecosystem
Spark-HPCC Systems Connector - Progress
• Added support for remote writing
• HPCC Systems 7.2+
• Improved performance
• Scala, Python and R
• Increased reliability
• Lots of testing and bug fixes
• Added support for DataSource API v1
• Unified Read / Write interface
Leveraging the Spark-HPCC Systems Ecosystem
Spark-HPCC Systems Connector – Reading
Leveraging the Spark-HPCC Systems Ecosystem
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Read dataset from HPCC Systems
df = spark.read.load(format="hpcc",
host=clusterURL,
password="",
username="",
limitPerFilePart=100,
projectList="field1, field2",
fileAccessTimeout=240,
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Read dataset from HPCC Systems
df <- read.df(source = "hpcc",
host = clusterURL,
password = "",
username = "",
limitPerFilePart = 100,
projectList = "field1, field2",
fileAccessTimeout = 240,
path = fileName)
PySpark Read Example SparkR Read Example
Spark-HPCC Systems Connector – Writing
Leveraging the Spark-HPCC Systems Ecosystem
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Write dataset to HPCC Systems
df.write.save(format="hpcc",
mode="overwrite",
host=clusterURL,
password="",
username="",
cluster="mythor",
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Write dataset to HPCC Systems
write.df(df, source = "hpcc",
host = clusterURL,
cluster = "mythor",
path = fileName,
mode = "overwrite",
password = "",
username = "",
fileAccessTimeout = 240)
PySpark Write Example SparkR Write Example
Apache Zeppelin
Apache Zeppelin - Overview
• Multi-user Notebook Environment
• Front end for Spark
• Collaborative
• Easy to use
• Handles resource management
• Handles job queuing and resource allocation
• We do not support or package Zeppelin
Leveraging the Spark-HPCC Systems Ecosystem
Apache Zeppelin – Features
• Multi-user environment by default
• Version Control
• Interpreters are bound at a Paragraph level
• Allows multiple languages in a single notebook
• Built-in visualization tools
• Ability to move data between languages
• Credential management
Leveraging the Spark-HPCC Systems Ecosystem
Spark ML Model
Spark ML Model – Brief Intro to Random Forests
• Random Forests: Ensemble of decision
trees
• Averaging output of multiple decision trees
gives a better prediction
• Random Forests requires data to be
numeric
Leveraging the Spark-HPCC Systems Ecosystem
Spark ML Model – Bulldozers R US
• Open source bulldozer auction dataset from Kaggle
• www.kaggle.com/c/bluebook-for-bulldozers
• Create a Random Forest Model to predict auction price
• Compare our model against the Kaggle leaderboard
• Score is calculated by RMSLE (Root Mean Square Log Error)
• RMSLE provides a percentage based error
Leveraging the Spark-HPCC Systems Ecosystem
Leveraging the Spark-HPCC Systems Ecosystem
Spark ML Model – Results
• Our RMSLE ~ 0.26
• Around 50th out of 450 participants
• Not bad for little to no feature engineering
• RMSLE around ~0.22 is possible with Random Forests
• Hyper parameter tuning
• Feature engineering
• Deep Learning can do better than ~0.22
Leveraging the Spark-HPCC Systems Ecosystem
Spark-HPCC Systems – Future & Future Use Cases
• Continued support and improvement
• Leveraging libraries in Spark, Python and R
• Optimus – Data cleaning for Spark
• Matplotlib
• Spark Streaming
• IoT Events
• Telematics
• Deep Learning with Spark
• Possible now through external libraries
• Spark 3.0 will support Tensorflow natively
Leveraging the Spark-HPCC Systems Ecosystem
Closing Thoughts
• Spark-HPCC Systems ecosystem provides new opportunities
• Access to an entire ecosystem of libraries and tools
• Apache Zeppelin is great
• Machine Learning and Deep Learning are accessible
• FastAI MOOC is a great way to learn
• Everyone should learn ML & Deep Learning
Leveraging the Spark-HPCC Systems Ecosystem
Questions?
Spark Plugin:
https://hpccsystems.com/download
Spark-HPCC Systems Connector:
https://github.com/hpcc-systems/Spark-HPCC
Bulldozer Model Notebook:
https://github.com/hpcc-systems/
FastAI:
https://fast.ai
View this presentation on YouTube:
https://www.youtube.com/watch?v=AQF9XP-Hd74&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=4&t=0s
(4:55:00)
Leveraging the Spark-HPCC Systems Ecosystem

More Related Content

PDF
Spark Summit EU talk by Jim Dowling
PDF
How To Connect Spark To Your Own Datasource
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
PPTX
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
PPTX
Kafka Lambda architecture with mirroring
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Spark Summit EU talk by Jim Dowling
How To Connect Spark To Your Own Datasource
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Kafka Lambda architecture with mirroring
Mobility insights at Swisscom - Understanding collective mobility in Switzerland

What's hot (20)

PDF
Spark Summit EU talk by Steve Loughran
PPTX
UCX-Python - A Flexible Communication Library for Python Applications
PDF
Scalable Scientific Computing with Dask
PDF
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Low Latency Execution For Apache Spark
PPTX
Speed it up and Spark it up at Intel
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Using Spark with Tachyon by Gene Pang
PDF
Operational Tips for Deploying Spark
PDF
Parallelize R Code Using Apache Spark
PPTX
Simplified Cluster Operation & Troubleshooting
PDF
Productionizing Spark and the Spark Job Server
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PPTX
5 Apache Spark Tips in 5 Minutes
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PDF
Apache Spark Performance: Past, Future and Present
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
Spark Summit EU talk by Jakub Hava
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Steve Loughran
UCX-Python - A Flexible Communication Library for Python Applications
Scalable Scientific Computing with Dask
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Low Latency Execution For Apache Spark
Speed it up and Spark it up at Intel
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Using Spark with Tachyon by Gene Pang
Operational Tips for Deploying Spark
Parallelize R Code Using Apache Spark
Simplified Cluster Operation & Troubleshooting
Productionizing Spark and the Spark Job Server
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
5 Apache Spark Tips in 5 Minutes
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark Performance: Past, Future and Present
Keeping Spark on Track: Productionizing Spark for ETL
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Miklos Christine paddling up the stream
Ad

Similar to Leveraging the Spark-HPCC Ecosystem (20)

PPTX
Innovation with Connection, The new HPCC Systems Plugins and Modules
PPTX
The Download: Tech Talks by the HPCC Systems Community, Episode 11
PPTX
HDPCD Spark using Python (pyspark)
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
HUG Ireland Event - HPCC Presentation Slides
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PPTX
Programming in Spark using PySpark
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Productionalizing a spark application
PDF
PyCon 2016: Personalised emails with Spark and Python
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PDF
Bds session 13 14
PPTX
MLflow Model Serving - DAIS 2021
PPTX
Intro to Apache Spark by CTO of Twingo
Innovation with Connection, The new HPCC Systems Plugins and Modules
The Download: Tech Talks by the HPCC Systems Community, Episode 11
HDPCD Spark using Python (pyspark)
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
End-to-end Data Pipeline with Apache Spark
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark for Everyone - Women Who Code Workshop
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
HUG Ireland Event - HPCC Presentation Slides
Strata NYC 2015 - What's coming for the Spark community
Accelerating Big Data beyond the JVM - Fosdem 2018
Programming in Spark using PySpark
Simplifying Big Data Analytics with Apache Spark
Productionalizing a spark application
PyCon 2016: Personalised emails with Spark and Python
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Bds session 13 14
MLflow Model Serving - DAIS 2021
Intro to Apache Spark by CTO of Twingo
Ad

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
PPTX
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
.pdf is not working space design for the following data for the following dat...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Ppt On Nestle.pptx huunnnhhgfvu
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Moving the Public Sector (Government) to a Digital Adoption

Leveraging the Spark-HPCC Ecosystem

  • 1. 2019 HPCC Systems® Community Day Challenge Yourself – Challenge the Status Quo James McMullan Sr Software Engineer LexisNexis Risk Solutions Leveraging the Spark-HPCC Systems Ecosystem
  • 2. Overview • Spark-HPCC Plugin & Connector • Basics of reading / writing to / from HPCC Systems • Brief introduction to Apache Zeppelin • Create a random forest model in Spark • Compare to Kaggle competition leaderboard • Future of Spark-HPCC Systems Ecosystem • Closing thoughts Leveraging the Spark-HPCC Systems Ecosystem
  • 4. Spark-HPCC Systems – Overview • Spark-HPCC Systems Connector • Spark library • Allows reading and writing to HPCC Systems • Can be installed on any Spark cluster • Spark Plugin – Managed Spark Cluster • Requires HPCC Systems 7.0+ • Spark cluster that mirrors Thor cluster • Configured through Config Manager • Installs Spark-HPCC Systems connector Leveraging the Spark-HPCC Systems Ecosystem
  • 5. Spark-HPCC Systems Connector - Progress • Added support for remote writing • HPCC Systems 7.2+ • Improved performance • Scala, Python and R • Increased reliability • Lots of testing and bug fixes • Added support for DataSource API v1 • Unified Read / Write interface Leveraging the Spark-HPCC Systems Ecosystem
  • 6. Spark-HPCC Systems Connector – Reading Leveraging the Spark-HPCC Systems Ecosystem clusterURL = "http://192.168.56.101:8010" fileName = "example::dataset" # Read dataset from HPCC Systems df = spark.read.load(format="hpcc", host=clusterURL, password="", username="", limitPerFilePart=100, projectList="field1, field2", fileAccessTimeout=240, path=fileName) clusterURL <- "http://192.168.56.101:8010" fileName <- "example::dataset" # Read dataset from HPCC Systems df <- read.df(source = "hpcc", host = clusterURL, password = "", username = "", limitPerFilePart = 100, projectList = "field1, field2", fileAccessTimeout = 240, path = fileName) PySpark Read Example SparkR Read Example
  • 7. Spark-HPCC Systems Connector – Writing Leveraging the Spark-HPCC Systems Ecosystem clusterURL = "http://192.168.56.101:8010" fileName = "example::dataset" # Write dataset to HPCC Systems df.write.save(format="hpcc", mode="overwrite", host=clusterURL, password="", username="", cluster="mythor", path=fileName) clusterURL <- "http://192.168.56.101:8010" fileName <- "example::dataset" # Write dataset to HPCC Systems write.df(df, source = "hpcc", host = clusterURL, cluster = "mythor", path = fileName, mode = "overwrite", password = "", username = "", fileAccessTimeout = 240) PySpark Write Example SparkR Write Example
  • 9. Apache Zeppelin - Overview • Multi-user Notebook Environment • Front end for Spark • Collaborative • Easy to use • Handles resource management • Handles job queuing and resource allocation • We do not support or package Zeppelin Leveraging the Spark-HPCC Systems Ecosystem
  • 10. Apache Zeppelin – Features • Multi-user environment by default • Version Control • Interpreters are bound at a Paragraph level • Allows multiple languages in a single notebook • Built-in visualization tools • Ability to move data between languages • Credential management Leveraging the Spark-HPCC Systems Ecosystem
  • 12. Spark ML Model – Brief Intro to Random Forests • Random Forests: Ensemble of decision trees • Averaging output of multiple decision trees gives a better prediction • Random Forests requires data to be numeric Leveraging the Spark-HPCC Systems Ecosystem
  • 13. Spark ML Model – Bulldozers R US • Open source bulldozer auction dataset from Kaggle • www.kaggle.com/c/bluebook-for-bulldozers • Create a Random Forest Model to predict auction price • Compare our model against the Kaggle leaderboard • Score is calculated by RMSLE (Root Mean Square Log Error) • RMSLE provides a percentage based error Leveraging the Spark-HPCC Systems Ecosystem
  • 14. Leveraging the Spark-HPCC Systems Ecosystem
  • 15. Spark ML Model – Results • Our RMSLE ~ 0.26 • Around 50th out of 450 participants • Not bad for little to no feature engineering • RMSLE around ~0.22 is possible with Random Forests • Hyper parameter tuning • Feature engineering • Deep Learning can do better than ~0.22 Leveraging the Spark-HPCC Systems Ecosystem
  • 16. Spark-HPCC Systems – Future & Future Use Cases • Continued support and improvement • Leveraging libraries in Spark, Python and R • Optimus – Data cleaning for Spark • Matplotlib • Spark Streaming • IoT Events • Telematics • Deep Learning with Spark • Possible now through external libraries • Spark 3.0 will support Tensorflow natively Leveraging the Spark-HPCC Systems Ecosystem
  • 17. Closing Thoughts • Spark-HPCC Systems ecosystem provides new opportunities • Access to an entire ecosystem of libraries and tools • Apache Zeppelin is great • Machine Learning and Deep Learning are accessible • FastAI MOOC is a great way to learn • Everyone should learn ML & Deep Learning Leveraging the Spark-HPCC Systems Ecosystem
  • 18. Questions? Spark Plugin: https://hpccsystems.com/download Spark-HPCC Systems Connector: https://github.com/hpcc-systems/Spark-HPCC Bulldozer Model Notebook: https://github.com/hpcc-systems/ FastAI: https://fast.ai
  • 19. View this presentation on YouTube: https://www.youtube.com/watch?v=AQF9XP-Hd74&list=PL-8MJMUpp8IKH5- d56az56t52YccleX5h&index=4&t=0s (4:55:00) Leveraging the Spark-HPCC Systems Ecosystem

Editor's Notes

  • #10: We had a problem. We needed a front end interface for Spark Using command line to submit jobs to a cluster is not a good workflow for datascience This is a solved problem. Notebook environments like Jupyter Notebooks or Apache Zeppelin were created to solve this problem Internally we evaluated both Jupyter Notebooks or Apache Zeppelin and found that Apache Zeppelin met our needs better than Jupyter Notebooks We have been testing Apache Zeppelin with Spark since Feburary We have also contributed some code to mainline Zeppelin to meet our needs We aren’t packaging Zeppelin alongside the Spark-HPCC environment The reason I am discussing Zeppelin. Is I will be using Zeppelin during the demo portion of the talk and wanted to give some background
  • #15: Show Spark demo