SlideShare a Scribd company logo
Vitalii Bondarenko
Data Platform Competency Manager at Eleks
Vitaliy.bondarenko@eleks.com
HDInsight: Spark
Advanced in-memory BigData Analytics with Microsoft Azure
Agenda
● Spark Platform
● Spark Core
● Spark Extensions
● Using HDInsight Spark
About me
Vitalii Bondarenko
Data Platform Competency Manager
Eleks
www.eleks.com
20 years in software development
9+ years of developing for MS SQL Server
3+ years of architecting Big Data Solutions
● DW/BI Architect and Technical Lead
● OLTP DB Performance Tuning
●
Big Data Data Platform Architect
Spark Platform
Spark Stack
● Clustered computing platform
● Designed to be fast and general purpose
● Integrated with distributed systems
● API for Python, Scala, Java, clear and understandable code
● Integrated with Big Data and BI Tools
● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O
● First Apache release 2013, this moth v.2.0 has been released
Map-reduce computations
In-memory map-reduce
Execution Model
Spark Execution
● Shells and Standalone application
● Local and Cluster (Standalone, Yarn, Mesos, Cloud)
Spark Cluster Arhitecture
● Master / Cluster manager
● Cluster allocates resources on nodes
● Master sends app code and tasks tor nodes
● Executers run tasks and cache data
Connect to Cluster
● Local
● SparkContext and Master field
● spark://host:7077
● Spark-submit
DEMO: Execution Environments
● Local Spark installation
● Shells and Notebook
● Spark Examples
● HDInsight Spark Cluster
● SSH connection to Spark in Azure
● Jupyter Notebook connected to HDInsight Spark
Spark Core
RDD: resilient distributed dataset
● Parallelized collections with fault-tolerant (Hadoop datasets)
● Transformations set new RDDs (filter, map, distinct, union, subtract, etc)
● Actions call to calculations (count, collect, first)
● Transformations are lazy
● Actions trigger transformations computation
● Broadcast Variables send data to executors
● Accumulators collect data on driver
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
print "Input had " + badLinesRDD.count() + " concerning lines"
Spark program scenario
● Create RDD (loading external datasets, parallelizing a
collection on driver)
● Transform
● Persist intermediate RDDs as results
● Launch actions
Persistence (Caching)
● Avoid recalculations
● 10x faster in-memory
● Fault-tolerant
● Persistence levels
● Persist before first action
input = sc.parallelize(xrange(1000))
result = input.map(lambda x: x ** x)
result.persist(StorageLevel.MEMORY_ONLY)
result.count()
result.collect()
Transformations (1)
Transformations (2)
Actions (1)
Actions (2)
Data Partitioning
● userData.join(events)
● userData.partitionBy(100).persist()
● 3-4 partitions on CPU Core
● userData.join(events).mapValues(...).reduceByKey(...)
DEMO: Spark Core Operations
● Transformations
● Actions
Spark Extensions
Spark Streaming Architecture
● Micro-batch architecture
● SparkStreaming Concext
● Batch interval from 500ms
●
Transformation on Spark Engine
●
Outup operations instead of Actions
● Different sources and outputs
Spark Streaming Example
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
input_stream = ssc.textFileStream("sampleTextDir")
word_pairs = input_stream.flatMap(
lambda l:l.split(" ")).map(lambda w: (w,1))
counts = word_pairs.reduceByKey(lambda x,y: x + y)
counts.print()
ssc.start()
ssc.awaitTermination()
● Process RDDs in batches
● Start after ssc.start()
● Output to console on Driver
● Awaiting termination
Streaming on a Cluster
● Receivers with replication
● SparkContext on Driver
● Output from Exectors in batches saveAsHadoopFiles()
● spark-submit for creating and scheduling periodical streaming jobs
● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
Streaming Transformations
● DStreams
●
Stateless transformantions
● Stagefull transformantions
● Windowed transformantions
● UpdateStateByKey
● ReduceByWindow, reduceByKeyAndWindow
● Recomended batch size from 10 sec
val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1))
val ipCountDStream = ipDStream.reduceByKeyAndWindow(
{(x, y) => x + y}, // Adding elements in the new batches entering the window
{(x, y) => x - y}, // Removing elements from the oldest batches exiting the window
Seconds(30), // Window duration
Seconds(10)) // Slide duration
DEMO: Spark Streaming
● Simple streaming with PySpark
Spark SQL
●
SparkSQL interface for working with structured data by SQL
●
Works with Hive tables and HiveQL
● Works with files (Json, Parquet etc) with defined schema
●
JDBC/ODBC connectors for BI tools
●
Integrated with Hive and Hive types, uses HiveUDF
●
DataFrame abstraction
Spark DataFrames
● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive
● df.show()
● df.select(“name”, df(“age”)+1)
● df.filtr(df(“age”) > 19)
● df.groupBy(df(“name”)).min()
# Import Spark SQLfrom pyspark.sql
import HiveContext, Row
# Or if you can't include the hive requirementsfrom pyspark.sql
import SQLContext, Row
sc = new SparkContext(...)
hiveCtx = HiveContext(sc)
sqlContext = SQLContext(sc)
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweet
CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
Catalyst: Query Optimizer
● Analysis: map tables, columns, function, create a logical plan
● Logical Optimization: applies rules and optimize the plan
● Physical Planing: physical operator for the logical plan execution
● Cost estimation
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
DEMO: Using SparkSQL
● Simple SparkSQL querying
● Data Frames
● Data exploration with SparkSQL
● Connect from BI
Spark ML
Spark ML
●
Classification
●
Regression
●
Clustering
● Recommendation
●
Feature transformation, selection
●
Statistics
●
Linear algebra
●
Data mining tools
Pipeline Cmponents
●
DataFrame
●
Transformer
●
Estimator
● Pipeline
●
Parameter
Logistic Regression
DEMO: Spark ML
● Training a model
● Data visualization
New in Spark 2.0
● Unifying DataFrames and Datasets in Scala/Java (compile time
syntax and analysis errors). Same performance and convertible.
● SparkSession: a new entry point that supersedes SQLContext and
HiveContext.
● Machine learning pipeline persistence
● Distributed algorithms in R
● Faster Optimizer
● Structured Streaming
New in Spark 2.0
spark = SparkSession
.builder()
.appName("StructuredNetworkWordCount")
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode('complete')
.format('console')
.start()
query.awaitTermination()
windowedCounts = words.groupBy(
window(words.timestamp, '10 minutes', '5 minutes'),
words.word
).count()
HDInsight: Spark
Spark in Azure
HDInsight benefits
● Ease of creating clusters (Azure portal, PowerShell, .Net SDK)
●
Ease of use (noteboks, azure control panels)
●
REST APIs (Livy: job server)
●
Support for Azure Data Lake Store (adl://)
●
Integration with Azure services (EventHub, Kafka)
●
Support for R Server (HDInsight R over Spark)
● Integration with IntelliJ IDEA (Plugin, create and submit apps)
● Concurrent Queries (many users and connections)
● Caching on SSDs (SSD as persist method)
● Integration with BI Tools (connectors for PowerBI and Tableau)
● Pre-loaded Anaconda libraries (200 libraries for ML)
● Scalability (change number of nodes and start/stop cluster)
● 24/7 Support (99% up-time)
HDInsight Spark Scenarious
1. Streaming data, IoT and real-time analytics
2. Visual data exploration and interactive analysis (HDFS)
3. Spark with NoSQL (HBase and Azure DocumentDB)
4. Spark with Data Lake
5. Spark with SQL Data Warehouse
6. Machine Learning using R Server, Mllib
7. Putting it all together in a notebook experience
8. Using Excel with Spark
Q&A

More Related Content

PDF
Valerii Vasylkov Erlang. measurements and benefits.
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
PPTX
Tuning and Debugging in Apache Spark
PDF
Cassandra + Spark + Elk
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PDF
Apache Spark RDDs
Valerii Vasylkov Erlang. measurements and benefits.
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Tuning and Debugging in Apache Spark
Cassandra + Spark + Elk
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Apache Spark RDDs

What's hot (20)

PDF
Spark Community Update - Spark Summit San Francisco 2015
PDF
Optimizing Presto Connector on Cloud Storage
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PDF
Adding Complex Data to Spark Stack by Tug Grall
PPTX
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
PDF
Engineering fast indexes
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Tuning and Debugging in Apache Spark
PDF
Cassandra & Spark for IoT
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Spark Community Update - Spark Summit San Francisco 2015
Optimizing Presto Connector on Cloud Storage
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Adding Complex Data to Spark Stack by Tug Grall
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Engineering fast indexes
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit East 2015 Advanced Devops Student Slides
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Using Apache Spark as ETL engine. Pros and Cons
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Using Spark to Load Oracle Data into Cassandra
Tuning and Debugging in Apache Spark
Cassandra & Spark for IoT
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Ad

Viewers also liked (20)

PDF
Lianjia data infrastructure, Yi Lyu
PDF
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
PDF
Callcenter HPE IDOL overview
PDF
ANTS - 360 view of your customer - bigdata innovation summit 2016
PDF
クラウドを活用した自由自在なデータ分析
PDF
Oxalide MorningTech #1 - BigData
PDF
BigData HUB Workshop
PDF
GCPUG meetup 201610 - Dataflow Introduction
PDF
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
PPTX
BigData & Hadoop - Technology Latinoware 2016
PPTX
Bigdata analytics and our IoT gateway
PDF
Big Data Patients and New Requirements for Clinical Systems
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PPTX
O αργυροπελεκάνος
PPT
Javaepic
PPT
learn Onlinejava
Lianjia data infrastructure, Yi Lyu
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Callcenter HPE IDOL overview
ANTS - 360 view of your customer - bigdata innovation summit 2016
クラウドを活用した自由自在なデータ分析
Oxalide MorningTech #1 - BigData
BigData HUB Workshop
GCPUG meetup 201610 - Dataflow Introduction
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
BigData & Hadoop - Technology Latinoware 2016
Bigdata analytics and our IoT gateway
Big Data Patients and New Requirements for Clinical Systems
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
O αργυροπελεκάνος
Javaepic
learn Onlinejava
Ad

Similar to SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure" (20)

PDF
Dev Ops Training
PDF
Bds session 13 14
PPTX
In Memory Analytics with Apache Spark
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Apache Spark on HDinsight Training
PDF
Apache Spark Overview @ ferret
PPTX
Intro to Spark - for Denver Big Data Meetup
PDF
A look ahead at spark 2.0
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
New Analytics Toolbox
PDF
Introduction to Spark Training
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PPTX
Intro to Spark development
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PDF
Spark forplainoldjavageeks svforum_20140724
Dev Ops Training
Bds session 13 14
In Memory Analytics with Apache Spark
Jump Start with Apache Spark 2.0 on Databricks
Simplifying Big Data Analytics with Apache Spark
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Apache Spark on HDinsight Training
Apache Spark Overview @ ferret
Intro to Spark - for Denver Big Data Meetup
A look ahead at spark 2.0
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
New Analytics Toolbox
Introduction to Spark Training
Lighting up Big Data Analytics with Apache Spark in Azure
Introduction to Spark - Phoenix Meetup 08-19-2014
Intro to Spark development
Big_data_analytics_NoSql_Module-4_Session
Intro to Apache Spark by CTO of Twingo
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Spark forplainoldjavageeks svforum_20140724

More from Inhacking (20)

PDF
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
PDF
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
PDF
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
PDF
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
PDF
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
PDF
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
PDF
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
PDF
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
PDF
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
PDF
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
PDF
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
PDF
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
PDF
SE2016 Management Yana Prolis "Please don't burn down!"
PDF
SE2016 Management Marina Bril "Management at marketing teams and performance"
PDF
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
PDF
SE2016 iOS Alexander Voronov "Test driven development in real world"
PDF
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
PDF
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
PDF
SE2016 Java Vladimir Mikhel "Scrapping the web"
PDF
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
SE2016 Management Yana Prolis "Please don't burn down!"
SE2016 Management Marina Bril "Management at marketing teams and performance"
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
SE2016 iOS Alexander Voronov "Test driven development in real world"
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
SE2016 Java Vladimir Mikhel "Scrapping the web"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
Moving the Public Sector (Government) to a Digital Adoption
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure"

  • 1. Vitalii Bondarenko Data Platform Competency Manager at Eleks Vitaliy.bondarenko@eleks.com HDInsight: Spark Advanced in-memory BigData Analytics with Microsoft Azure
  • 2. Agenda ● Spark Platform ● Spark Core ● Spark Extensions ● Using HDInsight Spark
  • 3. About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing for MS SQL Server 3+ years of architecting Big Data Solutions ● DW/BI Architect and Technical Lead ● OLTP DB Performance Tuning ● Big Data Data Platform Architect
  • 5. Spark Stack ● Clustered computing platform ● Designed to be fast and general purpose ● Integrated with distributed systems ● API for Python, Scala, Java, clear and understandable code ● Integrated with Big Data and BI Tools ● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O ● First Apache release 2013, this moth v.2.0 has been released
  • 8. Execution Model Spark Execution ● Shells and Standalone application ● Local and Cluster (Standalone, Yarn, Mesos, Cloud) Spark Cluster Arhitecture ● Master / Cluster manager ● Cluster allocates resources on nodes ● Master sends app code and tasks tor nodes ● Executers run tasks and cache data Connect to Cluster ● Local ● SparkContext and Master field ● spark://host:7077 ● Spark-submit
  • 9. DEMO: Execution Environments ● Local Spark installation ● Shells and Notebook ● Spark Examples ● HDInsight Spark Cluster ● SSH connection to Spark in Azure ● Jupyter Notebook connected to HDInsight Spark
  • 11. RDD: resilient distributed dataset ● Parallelized collections with fault-tolerant (Hadoop datasets) ● Transformations set new RDDs (filter, map, distinct, union, subtract, etc) ● Actions call to calculations (count, collect, first) ● Transformations are lazy ● Actions trigger transformations computation ● Broadcast Variables send data to executors ● Accumulators collect data on driver inputRDD = sc.textFile("log.txt") errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badLinesRDD = errorsRDD.union(warningsRDD) print "Input had " + badLinesRDD.count() + " concerning lines"
  • 12. Spark program scenario ● Create RDD (loading external datasets, parallelizing a collection on driver) ● Transform ● Persist intermediate RDDs as results ● Launch actions
  • 13. Persistence (Caching) ● Avoid recalculations ● 10x faster in-memory ● Fault-tolerant ● Persistence levels ● Persist before first action input = sc.parallelize(xrange(1000)) result = input.map(lambda x: x ** x) result.persist(StorageLevel.MEMORY_ONLY) result.count() result.collect()
  • 18. Data Partitioning ● userData.join(events) ● userData.partitionBy(100).persist() ● 3-4 partitions on CPU Core ● userData.join(events).mapValues(...).reduceByKey(...)
  • 19. DEMO: Spark Core Operations ● Transformations ● Actions
  • 21. Spark Streaming Architecture ● Micro-batch architecture ● SparkStreaming Concext ● Batch interval from 500ms ● Transformation on Spark Engine ● Outup operations instead of Actions ● Different sources and outputs
  • 22. Spark Streaming Example from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 1) input_stream = ssc.textFileStream("sampleTextDir") word_pairs = input_stream.flatMap( lambda l:l.split(" ")).map(lambda w: (w,1)) counts = word_pairs.reduceByKey(lambda x,y: x + y) counts.print() ssc.start() ssc.awaitTermination() ● Process RDDs in batches ● Start after ssc.start() ● Output to console on Driver ● Awaiting termination
  • 23. Streaming on a Cluster ● Receivers with replication ● SparkContext on Driver ● Output from Exectors in batches saveAsHadoopFiles() ● spark-submit for creating and scheduling periodical streaming jobs ● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
  • 24. Streaming Transformations ● DStreams ● Stateless transformantions ● Stagefull transformantions ● Windowed transformantions ● UpdateStateByKey ● ReduceByWindow, reduceByKeyAndWindow ● Recomended batch size from 10 sec val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow( {(x, y) => x + y}, // Adding elements in the new batches entering the window {(x, y) => x - y}, // Removing elements from the oldest batches exiting the window Seconds(30), // Window duration Seconds(10)) // Slide duration
  • 25. DEMO: Spark Streaming ● Simple streaming with PySpark
  • 26. Spark SQL ● SparkSQL interface for working with structured data by SQL ● Works with Hive tables and HiveQL ● Works with files (Json, Parquet etc) with defined schema ● JDBC/ODBC connectors for BI tools ● Integrated with Hive and Hive types, uses HiveUDF ● DataFrame abstraction
  • 27. Spark DataFrames ● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive ● df.show() ● df.select(“name”, df(“age”)+1) ● df.filtr(df(“age”) > 19) ● df.groupBy(df(“name”)).min() # Import Spark SQLfrom pyspark.sql import HiveContext, Row # Or if you can't include the hive requirementsfrom pyspark.sql import SQLContext, Row sc = new SparkContext(...) hiveCtx = HiveContext(sc) sqlContext = SQLContext(sc) input = hiveCtx.jsonFile(inputFile) # Register the input schema RDD input.registerTempTable("tweets") # Select tweets based on the retweet CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
  • 28. Catalyst: Query Optimizer ● Analysis: map tables, columns, function, create a logical plan ● Logical Optimization: applies rules and optimize the plan ● Physical Planing: physical operator for the logical plan execution ● Cost estimation SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1
  • 29. DEMO: Using SparkSQL ● Simple SparkSQL querying ● Data Frames ● Data exploration with SparkSQL ● Connect from BI
  • 30. Spark ML Spark ML ● Classification ● Regression ● Clustering ● Recommendation ● Feature transformation, selection ● Statistics ● Linear algebra ● Data mining tools Pipeline Cmponents ● DataFrame ● Transformer ● Estimator ● Pipeline ● Parameter
  • 32. DEMO: Spark ML ● Training a model ● Data visualization
  • 33. New in Spark 2.0 ● Unifying DataFrames and Datasets in Scala/Java (compile time syntax and analysis errors). Same performance and convertible. ● SparkSession: a new entry point that supersedes SQLContext and HiveContext. ● Machine learning pipeline persistence ● Distributed algorithms in R ● Faster Optimizer ● Structured Streaming
  • 34. New in Spark 2.0 spark = SparkSession .builder() .appName("StructuredNetworkWordCount") .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:9999 lines = spark .readStream .format('socket') .option('host', 'localhost') .option('port', 9999) .load() # Split the lines into words words = lines.select( explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count() # Start running the query that prints the running counts to the console query = wordCounts .writeStream .outputMode('complete') .format('console') .start() query.awaitTermination() windowedCounts = words.groupBy( window(words.timestamp, '10 minutes', '5 minutes'), words.word ).count()
  • 37. HDInsight benefits ● Ease of creating clusters (Azure portal, PowerShell, .Net SDK) ● Ease of use (noteboks, azure control panels) ● REST APIs (Livy: job server) ● Support for Azure Data Lake Store (adl://) ● Integration with Azure services (EventHub, Kafka) ● Support for R Server (HDInsight R over Spark) ● Integration with IntelliJ IDEA (Plugin, create and submit apps) ● Concurrent Queries (many users and connections) ● Caching on SSDs (SSD as persist method) ● Integration with BI Tools (connectors for PowerBI and Tableau) ● Pre-loaded Anaconda libraries (200 libraries for ML) ● Scalability (change number of nodes and start/stop cluster) ● 24/7 Support (99% up-time)
  • 38. HDInsight Spark Scenarious 1. Streaming data, IoT and real-time analytics 2. Visual data exploration and interactive analysis (HDFS) 3. Spark with NoSQL (HBase and Azure DocumentDB) 4. Spark with Data Lake 5. Spark with SQL Data Warehouse 6. Machine Learning using R Server, Mllib 7. Putting it all together in a notebook experience 8. Using Excel with Spark
  • 39. Q&A