Lighting up Big Data Analytics with Apache Spark in Azure

Apache Spark in Azure
Jen Stirrup, Data Whisperer, Data Relish UK
Lighting up Big
Data Analytics

Please silence
cell phones
Please silence
cell phones
2

Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Free Online Resources
PASS Blog
White Papers
Session Recordings
Newsletter www.pass.org
Explore everything PASS has to offer
PASS Connector
BA Insights
Get involved

Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:

Jen Stirrup
Data Whisperer
Data Relish UK
Postgrad in Artificial Intelligence
Universities in the UK and Paris
AI and BI Consultant for 20 years
Global delivery of projects
Author
Published author on Business Intelligence
technology boos
/jenstirrup @jenstirrup jenstirrup

Artificial
Intelligence of
Business
Intelligence

Lighting up Big Data Analytics with Apache Spark in Azure

Apache Spark™ is a fast and general engine for large-scale data processing.

Apache Spark
It is the largest open source process in data processing.
Since its release, Apache Spark has seen rapid adoption
by enterprises across a wide range of industries.
Apache Spark is a fast, in-memory data processing
engine with elegant and expressive development APIs
to allow data workers to efficiently execute streaming.
As well as machine learning or SQL workloads that
require fast iterative access to datasets

Why Apache Spark?
FASTER THAN HADOOP RUNS EVERYWHERE

Apache Spark
Apache Spark consists of Spark Core and a set of libraries.
The core is the distributed execution engine and the Java,
Scala, and Python APIs offer a platform for distributed ETL
application development.
Quickly achieve success by writing applications in Java,
Scala, or Python.

Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs) are the fundamental
object used in Apache Spark.
RDDs are immutable collections representing datasets
New RDDs are created upon any operation
Lineage is also stored

Input
File
RDD RDDRDD
Output
File
Read Map Filter Reduce

Apache Spark
It comes with a built-in set of over 80 high-level operators.
And you can use it interactively to query data within the
shell.
In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing.

Apache Spark
Developers can use these capabilities stand-alone or
combine them to run in a single data pipeline use case

Spark Components on HDInsight
Apache Spark is an open-source parallel processing
framework that supports in-memory processing to boost
the performance of big-data analytic applications.
Spark cluster on HDInsight is compatible with Azure
Storage (WASB) as well as Azure Data Lake Store.

Apache Spark
When you create a Spark cluster on HDInsight, you create
Azure compute resources with Spark installed and
configured.
It only takes about 10 minutes to create a Spark cluster in
HDInsight. The data to be processed is stored in Azure
Storage or Azure Data Lake Store.

Apache Spark
Spark provides primitives for in-memory cluster
computing.
A Spark job can load and cache data into memory and
query it repeatedly, much more quickly than disk-based
systems.
Spark also integrates into the Scala programming
language to let you manipulate distributed data sets like
local collections.

What does Spark give you?
Apache Spark is a powerful open source processing engine
for Hadoop data built around speed, easy to use, and
sophisticated analytics.
When comes to BigData processing speed always matters.
We always look for processing our huge data as fast as
possible.

What does Spark give you
Spark enables applications in Hadoop clusters to run up to
100x faster in memory, and 10x faster even when running
on disk.
Spark makes it possible by reducing number of read/write
to disc. It stores this intermediate processing data in-
memory.

Why Spark?
Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming
lets you rapidly develop streaming applications
Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark
Streaming recovers lost work and delivers exactly-once semantics out
of the box with no extra code or configuration
Integrated: Reuse the same code for batch and stream processing,
even joining streaming data to historical data

Why Spark?
It uses the concept of Resilient Distributed Dataset (RDD),
which allows it to transparently store data on memory and
persist it to disc only it’s needed.
This helps to reduce most of the disc read and write the
main time consuming factors of data processing.

YARN Data Operating system:
YARN is one of the key features in the second-generation
Hadoop 2 version of the Apache Software Foundation's
open source distributed processing framework.
Originally described by Apache as a redesigned resource
manager, YARN is now characterized as a large-scale,
distributed operating system for big data applications.

YARN Data Operating system:
YARN is a software rewrite that decouples MapReduce's
resource management and scheduling capabilities from the
data processing component, enabling Hadoop to support
more varied processing approaches and a broader array of
applications.

Spark Deployment Modes:
Two deployment modes can be used to launch Spark applications:
In cluster mode, jobs are managed by the YARN cluster. The Spark driver runs
inside an Application Master (AM) process that is managed by YARN. This
means that the client can go away after initiating the application.
In client mode, the Spark driver runs in the client process, and the Application
Master is used only to request resources from YARN.

Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data
structure of Spark. It is an immutable distributed collection
of objects.
Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects,
including user-defined classes.

There are two ways to create RDDs:
Parallelizing an existing collection in your driver program, or
referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.

Parallelized Collections
Parallelized collections are created by calling SparkContext’s
parallelize method on an existing collection in your driver
program (a Scala Seq). The elements of the collection are
copied to form a distributed dataset that can be operated
on in parallel.

External Datasets
Spark can create distributed datasets from any storage
source supported by Hadoop, including your local file
system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop
InputFormat

Transformations
Map (func): Return a new distributed dataset formed by
passing each element of the source through a function
func.
Filter (func): Return a new dataset formed by selecting
those elements of the source on which func returns true.
Distinct (numTasks): Return a new dataset that contains
the distinct elements of the source dataset.

• Agenda item one
• Agenda item two
• Agenda item three
• Agenda item four
• Agenda item five
Agenda
An agenda slide is highly recommended so attendees understand what
you will be presenting and to minimize session hopping.

Thank You
Learn more from Speaker Name
email@company.com@yourhandle

Lighting up Big Data Analytics with Apache Spark in Azure

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Lighting up Big Data Analytics with Apache Spark in Azure (20)

More from Jen Stirrup (20)

Recently uploaded (20)

Lighting up Big Data Analytics with Apache Spark in Azure

Editor's Notes