SlideShare a Scribd company logo
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
5 Best Practices in DevOps Culture
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What to expect?
Why Apache
Spark?
Use Case
5
Hands-On
Examples
4
Spark Ecosystem
3
Spark Features
2
1
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Big Data Analytics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Data Generated Every Minute!
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Big Data Analytics
➢ Big Data Analytics is the process of examining large data sets to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information
Batch Analytics Real Time Analytics
➢ Big Data Analytics is of two types:
1. Batch Analytics
2. Real-Time Analytics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark For Real Time Analysis
Use Cases For Real Time Analytics
Banking Government Healthcare Telecommunications Stock Market
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What Is Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What Is Spark?
 Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
 Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
 It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
Why Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Success Story
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Success Story
Twitter Sentiment
Analysis With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis of
Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Using Hadoop
Through Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
&
Spark can be used along
with MapReduce in the
same Hadoop cluster or
separately as a processing
framework
Spark applications can also
be run on YARN (Hadoop
NextGen)
MapReduce and Spark are used
together where MapReduce is
used for batch processing and
Spark for real-time processing
Spark can run on top
of HDFS to leverage
the distributed
replicated storage
Spark And Hadoop
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Speed
Multiple Languages
Advanced Analytics
Real Time
Hadoop Integration
Machine Learning
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Supports multiple data sourcesSpark runs upto 100x times faster
than MapReduce
vs
Lazy Evaluation: Delays evaluation till needed
Real time computation & low latency because of
in-memory computation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Hadoop Integration Machine Learning for iterative tasks
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Components
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Components
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Components
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
DataFrames ML Pipelines
Tabular data
abstraction
introduced by
Spark SQL
ML pipelines makes
it easier to combine
multiple algorithms
or workflows
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Core
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing
It is responsible for:
 Memory management and fault recovery
 Scheduling, distributing and monitoring jobs on a cluster
 Interacting with storage systems
Figure: Spark Core Job Cluster
Table
Row
Row
Row
Row
Result
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Architecture
Figure: Components of a Spark cluster
Driver Program
Spark Context
Cluster Manager
Worker Node
Executor
Cache
Task Task
Worker Node
Executor
Cache
Task Task
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
 Spark Streaming is used for processing real-time streaming data
 It is a useful addition to the core Spark API
 Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
 The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Figure: Overview Of Spark Streaming
MLlib
Machine Learning
Spark SQL
SQL + DataFrames
Spark Streaming
Streaming Data
Sources
Static Data
Sources
Data Storage
Systems
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
DStream
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words
DStream
flatMap
Operation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark SQL
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
Spark SQL integrates relational
processing with Spark’s functional
programming.
1
Spark SQL is used for the
structured/semi structured data
analysis in Spark.
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
SQL queries can be converted into
RDDs for transformations
Support for various data formats
3
RDD 1 RDD 2
Shuffle
transform
Drop split
point
Invoking RDD 2 computes all partitions of RDD 1
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Overview
Performance And Scalability
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
User
Standard JDBC/ODBC Connectivity
6
User Defined Functions lets users
define new Column-based functions
to extend the Spark vocabulary
7
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Flow Diagram
Spark SQL
Service
Interpreter &
OptimizerResilient
Distributed
Dataset
 Spark SQL has the following libraries:
1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service
 The flow diagram represents a Spark SQL process using all the four libraries in
sequence
DataFrame API
Named
Columns
Data Source
API
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
MLlib
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
MLlib
Supervised algorithms use labelled data in which both the input and output are provided to the
algorithm
Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make
sense of the data without labels
Machine Learning
Supervised
• Classification
- Naïve Bayes
- SVM
• Regression
- Linear
- Logistic
Unsupervised
• Clustering
- K Means
• Dimensionality
Reduction
- Principal
Component Analysis
- SVD
Machine Learning may be broken down into two classes of algorithms:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
1. Classification: It is a family of supervised machine
learning algorithms that designate input as belonging
to one of several pre-defined classes
Some common use cases for classification include:
i) Credit card fraud detection
ii) Email spam detection
2. Clustering: In clustering, an algorithm groups objects
into categories by analyzing similarities between input
examples
There are 3 common techniques for Machine Learning:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
Collaborative Filtering: Collaborative filtering algorithms
recommend items (this is the filtering part) based on
preference information from many users (this is the
collaborative part)
3.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
GraphX
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX
Graph Concepts
A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that
connect them. The vertices are the objects and the edges are the relationships between them.
A directed graph is a graph where the edges have a direction associated with them. E.g. User Sam follows John on Twitter.
Sam
John
Relationship: Friends
Edge Vertex
Sam
John
Relationship: Friends
Follows
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Triplet View
GraphX has Graph class that contains members to access edges and
vertices
Triplet View
The triplet view logically joins the vertex and edge properties yielding an
RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet class
A BTriplets:
A
B
Vertices:
A BEdges:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Property Graph
GraphX is the Spark API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a Resilient Distributed
Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined
properties associated with it. The parallel edges allow multiple relationships between the same vertices.
LAX
SJC
Vertex Property
Edge Property
Property Graph
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
To understand GraphX, let us consider the below graph.
 The vertices have names and ages of people.
 The edges represent whether a person likes a person and its weight is a measure of the likeability.
1 2 3
4 5 6
Alice
Age: 28
David
Age: 42
Ed
Age: 55
Fran
Age: 50
Charlie
Age: 65
Bob
Age: 27
7 4
1 3
3
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
val vertexRDD: RDD[(Long, (String, Int))] =
sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD,
edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age >
30 }.collect.foreach { case (id, (name, age)) =>
println(s"$name is $age")}
David is 42
Fran is 50
Ed is 55
Charlie is 65
Output
Display names and ages
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
for (triplet <- graph.triplets.collect)
{
println(s"${triplet.srcAttr._1} likes
${triplet.dstAttr._1}")
}
Bob likes Alice
Bob likes David
Charlie likes Bob
Charlie likes Fran
David likes Alice
Ed likes Bob
Ed likes Charlie
Ed likes Fran
Output
Display Relations
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Analyze Flight Data
Using Spark GraphX
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Problem Statement
Problem Statement
To analyse Real-Time Flight data using Spark GraphX, provide near real-time
computation results and visualize the results using Google Data Studio
Computations to be done:
 Compute the total number of flight routes
 Compute and sort the longest flight routes
 Display the airport with the highest degree vertex
 List the most important airports according to PageRank
 List the routes with the lowest flight costs
We will use Spark GraphX for the above computations and visualize the results using
Google Data Studio
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Flight Dataset
The attributes of each
particular row is as below:
1. Day Of Month
2. Day Of Week
3. Carrier Code
4. Unique ID- Tail Number
5. Flight Number
6. Origin Airport ID
7. Origin Airport Code
8. Destination Airport ID
9. Destination Airport Code
10. Scheduled Departure Time
11. Actual Departure Time
12. Departure Delay In Minutes
13. Scheduled Arrival Time
14. Actual Arrival Time
15. Arrival Delay Minutes
16. Elapsed Time
17. Distance
Figure: USA Airport Flight Data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Flow Diagram
Huge amount of
Flight data
1
Database storing
Real-Time Flight
Data
2
Creating Graph
Using GraphX
3
Calculate Top Busiest
Airports
Compute Longest
Flight Routes
Calculate Routes with
Lowest Flight Costs
USA Flight Mapping
4
4
Query 3
Query 1
Query 2
4
Visualizing using
Google Data
Studio
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Starting Spark Shell
//Importing the necessary classes
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.IntParam
import org.apache.spark.graphx._
import org.apache.spark.graphx.util.GraphGenerators
//Creating a Case Class ‘Flight’
case class Flight(dofM:String, dofW:String, carrier:String, tailnum:String,
flnum:Int, org_id:Long, origin:String, dest_id:Long, dest:String, crsdeptime:Double,
deptime:Double, depdelaymins:Double, crsarrtime:Double, arrtime:Double,
arrdelay:Double,crselapsedtime:Double,dist:Int)
//Defining a Parse String ‘parseFlight’ function to parse input into ‘Flight’ class
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong, line(6),
line(7).toLong, line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Starting Spark Shell
1
2
3
7
6
5
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
//Load the data into a RDD ‘textRDD’
val textRDD = sc.textFile("/home/edureka/Downloads/AirportDataset.csv")
//Parse the RDD of CSV lines into an RDD of flight classes
val flightsRDD = textRDD.map(parseFlight).cache()
//Create airports RDD with ID and Name
val airports = flightsRDD.map(flight => (flight.org_id, flight.origin)).distinct
airports.take(1)
//Defining a default vertex called ‘nowhere’ and mapping Airport ID for printlns
val nowhere = "nowhere"
val airportMap = airports.map { case ((org_id), name) => (org_id -> name)
}.collect.toList.toMap
//Create routes RDD with sourceID, destinationID and distance
val routes = flightsRDD.map(flight => ((flight.org_id, flight.dest_id),
flight.dist)).distinct
routes.take(2)
//Create edges RDD with sourceID, destinationID and distance
val edges = routes.map { case ((org_id, dest_id), distance) => Edge(org_id.toLong,
dest_id.toLong, distance)}
edges.take(1)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
1
2
3
5
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
7
9
8
6
10
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
//Define the graph and display some vertices and edges
val graph = Graph(airports, edges, nowhere)
graph.vertices.take(2)
graph.edges.take(2)
//Find the number of airports
val numairports = graph.numVertices
//Calculate the total number of routes?
val numroutes = graph.numEdges
//Calculate those routes with distances more than 1000 miles
graph.edges.filter { case ( Edge(org_id, dest_id,distance))=> distance > 1000}.take(3)
//Implementing edge triplets
graph.triplets.take(3).foreach(println)
//Sort and print the longest routes
graph.triplets.sortBy(_.attr, ascending=false).map(triplet => "Distance " +
triplet.attr.toString + " from " + triplet.srcAttr + " to " + triplet.dstAttr +
".").take(10).foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
1
2
3
4
5
6
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
7
8
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
//Define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b}
//Display highest degree vertices for incoming and outgoing flights of airports
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = graph.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = graph.degrees.reduce(max)
//Get the airport name with IDs 10397 and 12478
airportMap(10397)
airportMap(12478)
//Find the airport with the highest incoming flights
val maxIncoming = graph.inDegrees.collect.sortWith(_._2 > _._2).map(x => (airportMap(x._1),
x._2)).take(3)
maxIncoming.foreach(println)
//Find the airport with the highest outgoing flights
val maxout= graph.outDegrees.join(airports).sortBy(_._2._1, ascending=false).take(3)
maxout.foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
1
2
4
5
6
3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
7
8
9
10
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
//Find the most important airports according to PageRank
val ranks = graph.pageRank(0.1).vertices
val temp= ranks.join(airports)
temp.take(1)
//Sort the airports by ranking
val temp2 = temp.sortBy(_._2._1, false)
temp2.take(2)
//Display the most important airports
val impAirports =temp2.map(_._2._2)
impAirports.take(4)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
1
2
3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
4
7
6
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
//Implementing Pregel
val sourceId: VertexId = 13024
val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble/20 )
val initialGraph = gg.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist),
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) => math.min(a,b)
)
//Find the Routes with the lowest flight costs
println(sssp.edges.take(4).mkString("n"))
//Find airports and their lowest flight costs
println(sssp.vertices.take(4).mkString("n"))
//Display airport codes along with sorted lowest flight costs
sssp.vertices.collect.map(x => (airportMap(x._1), x._2)).sortWith(_._2 < _._2)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
5
1
2
3
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
5
6
7
8
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Visualizing Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Visualizing Results
 We will be using Google Data Studio to visualize our
analysis
 Google Data Studio is a product under Google Analytics
360 Suite
 The image shows a Sample Marketing website summary
using Geo Map, Time Series and Bar Chart
 We will use Geo Map service to map the Airports on
their respective locations on the USA map and display
the metrics quantity
Google Data Studio
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
1. Display the total number of flights per Airport
Figure: Total Number of Flights from New York
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
2. Display the metric sum of Destination routes from every Airport
Figure: Measure of the total outgoing traffic from Los Angeles
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
3. Display the total delay of all flights per Airport
Figure: Total Delay of all Flights at Atlanta
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Spark in Real Time Data Analytics.
The hands-on examples will give you the required confidence to work on any
future projects you encounter in Apache Spark.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Thank You …
Questions/Queries/Feedback

More Related Content

PDF
Introduction to Apache Spark
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
PPTX
Apache Spark Fundamentals
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Introduction to spark
PPTX
Intro to Apache Spark
PDF
Introduction to apache spark
Introduction to Apache Spark
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Fundamentals
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Introduction to spark
Intro to Apache Spark
Introduction to apache spark

What's hot (20)

PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Apache Spark overview
PDF
Introduction to Apache Spark
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Apache Spark Introduction
PPTX
Introduction to Apache Spark
PDF
Spark SQL
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Programming in Spark using PySpark
PDF
Hudi architecture, fundamentals and capabilities
PDF
Apache Spark Overview
PDF
Apache Spark 101
PPTX
Spark architecture
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Deep Dive: Memory Management in Apache Spark
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PDF
Productizing Structured Streaming Jobs
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Kafka Connect and Streams (Concepts, Architecture, Features)
Simplifying Big Data Analytics with Apache Spark
Apache Spark overview
Introduction to Apache Spark
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark Introduction
Introduction to Apache Spark
Spark SQL
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark in Depth: Core Concepts, Architecture & Internals
Programming in Spark using PySpark
Hudi architecture, fundamentals and capabilities
Apache Spark Overview
Apache Spark 101
Spark architecture
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Deep Dive: Memory Management in Apache Spark
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Productizing Structured Streaming Jobs
The Parquet Format and Performance Optimization Opportunities
Kafka Connect and Streams (Concepts, Architecture, Features)
Ad

Similar to What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka (20)

PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PDF
Spark For Faster Batch Processing
PDF
5 things one must know about spark!
PPTX
5 things one must know about spark!
PPTX
5 reasons why spark is in demand!
PDF
Spark SQL | Apache Spark
PDF
Big Data Processing With Spark
PDF
Spark is going to replace Apache Hadoop! Know Why?
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
PDF
Big Data Processing with Spark and Scala
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PDF
5 Reasons why Spark is in demand!
PPTX
Big data Processing with Apache Spark & Scala
PDF
Spark Will Replace Hadoop ! Know Why
PDF
Spark Streaming
PDF
Apache Spark beyond Hadoop MapReduce
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PPTX
Spark for big data analytics
PPTX
Apache Spark & Scala
PPTX
Scala & Spark Online Training
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark For Faster Batch Processing
5 things one must know about spark!
5 things one must know about spark!
5 reasons why spark is in demand!
Spark SQL | Apache Spark
Big Data Processing With Spark
Spark is going to replace Apache Hadoop! Know Why?
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Big Data Processing with Spark and Scala
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
5 Reasons why Spark is in demand!
Big data Processing with Apache Spark & Scala
Spark Will Replace Hadoop ! Know Why
Spark Streaming
Apache Spark beyond Hadoop MapReduce
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark for big data analytics
Apache Spark & Scala
Scala & Spark Online Training
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka