SlideShare a Scribd company logo
©2015 IBM Corporation
@RomeoKienzler
Cloud scale predictive DevOps
automation using Apache Spark
©2015 IBM Corporation
@RomeoKienzler
What you will learn
• What Spark really is and what is means to your UseCases
• How to use Spark in the Cloud
• Basic programming in Scala
• Basic programming in Python
• Some functional programming
• Some insights into Spark Streaming, MLLib, GraphX, Spark SQL (Shark)
• Solve any data analytics problem of any size
©2015 IBM Corporation
@RomeoKienzler
Introductions
©2015 IBM Corporation
@RomeoKienzler
Excursion, Demo: What is the IBM Cloud
about?
©2015 IBM Corporation
@RomeoKienzler
My Peers in US
©2015 IBM Corporation
@RomeoKienzler
What is our motivation?
• Local or cloud development and deployment
 Advantages of local development
• Rapid development
• Productivity
• Excellent for proof of concept
• Easy debugging
 Disadvantages of local development
• Time consuming for reproducing on a larger scale
• Difficult for sharing quickly
• Intense on hardware resource
• Demanding skills for deployment and operations
©2015 IBM Corporation
@RomeoKienzler
What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2015 IBM Corporation
@RomeoKienzler
Spark Core Libraries
Spark CoreSpark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
SQL
Spark
Streaming
Spark
Streaming
Mllib
(machine
learning)
Mllib
(machine
learning)
GraphX
(graph)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
©2015 IBM Corporation
@RomeoKienzler
Key reasons for interest in Spark
Open SourceOpen Source
FastFast
distributed data
processing
distributed data
processing
ProductiveProductive
Web ScaleWeb Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Spark SQL, Mllib, GraphX
©2015 IBM Corporation
@RomeoKienzler
Ecosystem of the IBM Analytics for Apache
Spark as service
0
©2015 IBM Corporation
@RomeoKienzler
A Word about the Scala Programming language
‣ Scala is Object oriented but also support functional programming style
‣ Bi-directional interoperability with Java
‣ Resources:
• Official web site: http://scala-lang.org
• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html
• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o
1
©2015 IBM Corporation
@RomeoKienzler
Spark Streaming
‣ “Spark Streaming is an extension of the core
Spark API that enables scalable, high-
throughput, fault-tolerant stream
processing of live data streams”
(
http://spark.apache.org/docs/latest/streaming
)
‣ Breakdown the Streaming data into smaller
pieces which are then sent to the Spark
Engine
2
©2015 IBM Corporation
@RomeoKienzler
Spark Streaming
‣ Provides connectors for multiple data
sources:
- Kafka
- Flume
- Twitter
- MQTT
- ZeroMQ
‣ Provides API to create custom connectors.
Lots of examples available on Github and
spark-packages.org
3
©2015 IBM Corporation
@RomeoKienzler
Introduction to Notebooks
‣ Notebooks allow creation of interactive
executable documents that include rich text
with Markdown, executable code with Scala,
Python or R, graphics with matplotlib
‣ First idea: Matematica in the 80s
‣ Apache Spark provides multiple flavor APIs
that can be executed with a REPL shell:
Scala, Python (PYSpark), R
‣ Multiple open-source implementations
available:
- Jupyter: https://jupyter.org
- Apache Zeppelin: http://zeppelin-project.org
4
©2015 IBM Corporation
@RomeoKienzler
GraphX
5
©2015 IBM Corporation
@RomeoKienzler
GraphX
6
[0,0.38321138272637756,[[532,0.6149796534336811],[664,0.8356153428569336],[9,0.1570050826694932]]]
[1,0.18065772749938025,[[575,0.17536476465887452],[411,0.27954200550966013],[649,0.8039858806410443],
[915,0.4486520294403563],[726,0.27371661315845497],[284,0.3189228134847226],[371,0.6743424877728893],
[105,0.02948311591149355]]]
[2,0.8326535898442957,[[187,0.237892453843756],[433,0.4888193209543986]]]
[3,0.8486227788712039,[[10,0.42657104117967704],[911,0.5044620825940729],[471,0.7925728999064424],[144,0.2682384916510707]]]
[4,0.213144518747322,[[287,0.5153627230542949],[500,0.9610167165689496],[471,0.7384315544250067]]]
[5,0.13936158086656125,[[788,0.6207349427530987],[716,0.8224267617783542],[29,0.9599548358124281],[446,0.6890358757389514],
[81,0.6200710121203236]]]
[6,0.18348506014555566,[[312,0.3572072639232693]]]
[7,0.4944948151337266,[[337,0.17081573705381814],[749,0.5357649236615107],[908,0.16851141164430072],
[94,0.46547674836585895],[327,0.8010320866648896]]]
[8,0.8065548204216567,[[706,0.7232142181639899],[981,0.9877867134305364],[581,0.4675382627711474]]]
[9,0.721217368691803,[]]
[10,0.9039814039370966,[[983,0.4159992760397089],[163,0.850921982262316],[50,0.22098242172416915],[483,0.8338046999885983],
[118,0.6589390317899275]]]
©2015 IBM Corporation
@RomeoKienzler
GraphX
7
©2015 IBM Corporation
@RomeoKienzler
Lab 1: Notebook walkthrough
‣ https://developer.ibm.com/clouddataservices
/start-developing-with-spark-and-notebooks/
‣ http://bit.ly/ibmvelocity1
‣ Sign up on Bluemix
http://ibm.biz/joinIBMCloud
‣ Create an Apache Starter boilerplate
application
‣ Create notebooks either in python or scala
or both
‣ Run basic commands and get familiar with8
©2015 IBM Corporation
@RomeoKienzler
Break
9
©2015 IBM Corporation
@RomeoKienzler
Use-cases
Customer Behavior
Analytics
Retail & Merchandising
Churn Reduction
Telco, Cable, Schools
Cyber Security
IT –Any Industry
Predictive Maintenance
(IoT)
IT –Any Industry
Network Performance
Optimization
IT –Any Industry
-Predict system failure before
it happens
-Network intrusion detection
-Fraud Detection
-…
-Predict customer drop-
offs/drop-outs
-Diagnose real-time device issues
-…
-Refine strategy based on
customer behaviour data
-…
0
‣ SETI use-case for astronomers, data
scientist, mathematician and algorithm design.
©2015 IBM Corporation
@RomeoKienzler
IBM Spark @ SETI - Application Architecture
• Spark@SETI
GitHub repository
• Python code modules for data
access and analytics
• Jupyter notebooks
• Documentation and links to
other relevant github repos
• Standard GitHub Collaboration
functions
Import of signal data
from SETI radio
telescope data archives ~
10 years
Shared repository of SETI data in Object Store
•200M rows of signal event data
•15M binary recordings of “signals of interest”
Collaborative environment
for project team data
scientists (NASA, SETI
Institute, Penn State, IBM
Research)
Actively analyzing over
4TB of signal data. Results
have already been used by
SETI to re-program the
radio telescope
observation sequence to
include “new targets of
interest”
21
©2015 IBM Corporation
@RomeoKienzler
Lab 2: Twitter Sentiment Analytics
‣ https://developer.ibm.com/clouddataservices
/sentiment-analysis-of-twitter-hashtags/
‣ http://bit.ly/ibmvelocity2
2
©2015 IBM Corporation
@RomeoKienzler
Demo 1: MLLib
3
©2015 IBM Corporation
@RomeoKienzler
Challenge: Calculate and Plot Apache HTTPD
response code distribution as bar charts
‣ Download the access_log file from
https://github.com/romeokienzler/developerW
orks
‣ http://bit.ly/ibmvelocity3
‣ Upload the file to the SWIFT Object Store
(Hint: Have a look at Tutorial 1 - Load
Data.ipynb)
‣ Use what you have learned so far to do it
yourself, either in Scala or Python
‣ I’ll walk around and help you (Hint: Google
for the WordCount example in Spark)4
©2015 IBM Corporation
@RomeoKienzler
Thank You
5

More Related Content

PDF
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
PDF
Real-time DeepLearning on IoT Sensor Data
PDF
Introduction to IAC and Terraform
PPT
Avoiding cloud lock-in
PPTX
Building Next Generation Clouds With OpenStack
PDF
Terraform
PPTX
Kirin User Story: Migrating Mission Critical Applications to OpenStack Privat...
PPTX
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Real-time DeepLearning on IoT Sensor Data
Introduction to IAC and Terraform
Avoiding cloud lock-in
Building Next Generation Clouds With OpenStack
Terraform
Kirin User Story: Migrating Mission Critical Applications to OpenStack Privat...
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight

What's hot (20)

PPTX
Application Centric DevOps
PPT
Cloud Standards and CloudStack
PDF
Lessons Learned Running The Largest OpenStack Clouds
PPTX
The Elephant in the Cloud: Bring True Cloud Economics to Hadoop/BigInsights
PPTX
Comparison of various streaming technologies
PDF
[OpenStack Day in Korea 2015] Track 2-2 - OpenStack for PaaS: Why it's Hot
PPTX
Netflix Cloud Architecture and Open Source
PDF
Introducing Cloud Development with Project Shipped and Mantl: a deep dive
PPTX
Infrastructure Automation on AWS using a Real-World Customer Example
PDF
Nine Publishing: Building a modern infrastructure with the Elastic Stack
PPTX
One Azure Monitor to Rule Them All? (IT Camp 2017, Cluj, RO)
PPTX
Application Centric Approach to Devops
PPTX
xPatterns - Spark Summit 2014
PPTX
OpenStack and Rackspace
PDF
"Kubernetes as Driver of Generic IT Automation"
PDF
Introduction to OpenStack
PDF
Flare: an overview
PDF
Building Cloud-Native Applications with OpenStack
PPTX
stackArmor - Security MicroSummit - McAfee
Application Centric DevOps
Cloud Standards and CloudStack
Lessons Learned Running The Largest OpenStack Clouds
The Elephant in the Cloud: Bring True Cloud Economics to Hadoop/BigInsights
Comparison of various streaming technologies
[OpenStack Day in Korea 2015] Track 2-2 - OpenStack for PaaS: Why it's Hot
Netflix Cloud Architecture and Open Source
Introducing Cloud Development with Project Shipped and Mantl: a deep dive
Infrastructure Automation on AWS using a Real-World Customer Example
Nine Publishing: Building a modern infrastructure with the Elastic Stack
One Azure Monitor to Rule Them All? (IT Camp 2017, Cluj, RO)
Application Centric Approach to Devops
xPatterns - Spark Summit 2014
OpenStack and Rackspace
"Kubernetes as Driver of Generic IT Automation"
Introduction to OpenStack
Flare: an overview
Building Cloud-Native Applications with OpenStack
stackArmor - Security MicroSummit - McAfee
Ad

Similar to Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amsterdam - O'Reilly Conferences, 28 - 30, October 2015 (20)

PDF
Ibm leads way with hadoop and spark 2015 may 15
PDF
20150617 spark meetup zagreb
PDF
What are DevOps Application Patterns on AWS…and why do I need them?
PDF
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
PDF
Building Cloud Native Applications with Oracle Autonomous Database.
PDF
HP Helion Webinar #4 - Open stack the magic pill
PDF
20151015 zagreb spark_notebooks
PPTX
Machine Learning with Apache Spark
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
APEX – jak vytvořit jednoduše aplikaci
PDF
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
PDF
KNIME Software Overview
PPT
Apex day 1.0 oracle cloud news_andrej valach
PDF
MySQL day Dublin - OCI & Application Development
PPTX
OpenStack Summit: How companies of all sizes leverage OpenStack based private...
PPTX
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
PDF
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
PPTX
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
PDF
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
Ibm leads way with hadoop and spark 2015 may 15
20150617 spark meetup zagreb
What are DevOps Application Patterns on AWS…and why do I need them?
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Building Cloud Native Applications with Oracle Autonomous Database.
HP Helion Webinar #4 - Open stack the magic pill
20151015 zagreb spark_notebooks
Machine Learning with Apache Spark
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
APEX – jak vytvořit jednoduše aplikaci
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
KNIME Software Overview
Apex day 1.0 oracle cloud news_andrej valach
MySQL day Dublin - OCI & Application Development
OpenStack Summit: How companies of all sizes leverage OpenStack based private...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
Ad

More from Romeo Kienzler (20)

PDF
Parallelization Stategies of DeepLearning Neural Network Training
PDF
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
PDF
Love & Innovative technology presented by a technology pioneer and an AI expe...
PDF
Blockchain Technology Book Vernisage
PDF
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
PDF
IBM Middle East Data Science Connect 2016 - Doha, Qatar
PDF
Apache SystemML - Declarative Large-Scale Machine Learning
PDF
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
PDF
DeepLearning and Advanced Machine Learning on IoT
PDF
Geo Python16 keynote
PDF
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
PDF
TDWI_DW2014_SQLNoSQL_DBAAS
PPT
Cloudant Overview Bluemix Meetup from Lisa Neddam
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
ODP
DBaaS Bluemix Meetup DACH 26.8.14
PDF
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
ODP
Cloud Databases, Developer Week Nuernberg 2014
ODP
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
PDF
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
PDF
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
Parallelization Stategies of DeepLearning Neural Network Training
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Love & Innovative technology presented by a technology pioneer and an AI expe...
Blockchain Technology Book Vernisage
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
IBM Middle East Data Science Connect 2016 - Doha, Qatar
Apache SystemML - Declarative Large-Scale Machine Learning
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
DeepLearning and Advanced Machine Learning on IoT
Geo Python16 keynote
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
TDWI_DW2014_SQLNoSQL_DBAAS
Cloudant Overview Bluemix Meetup from Lisa Neddam
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
DBaaS Bluemix Meetup DACH 26.8.14
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Cloud Databases, Developer Week Nuernberg 2014
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced methodologies resolving dimensionality complications for autism neur...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf

Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amsterdam - O'Reilly Conferences, 28 - 30, October 2015

Editor's Notes

  • #21: Churn Reduction – Telcos to public shcools. Dig into why there are drop-offs/drop-outs. Examples outcomes include discoveries that link every 1% reduction of churn to an actual revenue number, root cause / the under-utilization of services was a key leading indicator of churn, ranking of all users with propensity of immediate churn to optimize a proactive customer concierge service Cybersecurity – Network intrustion prevention. Crawl/walk/run strategy that managed security threats by starting with monitoring, progressed to prevention, and will eventually move into an offensive operations mode. A similar scenario would be Network Performance Modeling e.g. improving your capability to diagnose real-time device problems which led to a dramatically improved support experience for the end-user Other IT Operations - Customer Behavior Analysis - Significantly matured product development strategy by reframing segmentation based on user behavior – prioritized investment based on what people are doing … companies like Etsy are doing this with their website. …..Online retailer – personalized shopper experience + next logical product Predictive Maintenance – Airline company and elevator company example.