SlideShare a Scribd company logo
Analyzing	Flight	Delays	with	Apache	Spark	
GraphFrames	and	MapR-DB
2 © 2018 MapR Technologies, Inc. // MapR Confidential
Agenda	
•  Introduction	to	Graphs		
•  Introduction	to	GraphFrames	with	a	simple	Flight	Dataset	
•  Use	GraphFrames	with	Flight	Dataset	for	2018	
2
Intro	to	Graphs
4 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Graph:	Models	Relations	between	Objects	
•  Graph:	Vertices	connected	by	Edges	
•  Vertices:	the	objects	
•  Edges:	the	relationships	between	Vertices	
What	is	a	Graph?
5 © 2018 MapR Technologies, Inc. // MapR Confidential
Regular	graph:	each	vertex	has	the	same	
number	of	edges	
Example:	Facebook	friends	
– Ted	is	a	friend	of	Carol	
– Carol	is	a	friend	of	Ted	
Regular	Graphs	vs	Directed	Graphs
6 © 2018 MapR Technologies, Inc. // MapR Confidential
Directed	graph:	edges	have	a	direction	
Example:	Twitter	followers	
– Carol	follows	Oprah	
– Oprah	does	not	follow	Carol	
Regular	Graphs	vs	Directed	Graphs
7 © 2018 MapR Technologies, Inc. // MapR Confidential
Property	Graph:		
•  Edges	and	Vertexes	have	properties	
•  Vertex	can	have	multiple	directed	
edges	in	parallel	
•  Allows	multiple	relationships	
Spark	GraphX	supports	a	distributed	
property	graph.		
Property	Graph	 Properties:	
City,State	
Properties:	
Flight	number,	
Distance,	
Delay
8 © 2018 MapR Technologies, Inc. // MapR Confidential
What	is	GraphX?	
Spark SQL
•  Structured Data
•  Querying with
SQL/HQL
•  DataFrames
Spark Streaming
•  Processing of live
streams
•  Micro-batching
MLlib
•  Machine Learning
•  Multiple types of
ML algorithms
GraphX
•  Graph processing
•  Graph parallel
computations
•  Task scheduling
•  Memory management
•  Fault recovery
•  Interacting with storage systems
Spark Core
Graph	Algorithms	and		Graph	
Queries	with	GraphFrames
10 © 2018 MapR Technologies, Inc. // MapR Confidential
Web	Sites	
•  Vertices	=	Web	Pages	
•  Edges	=	Links	between	Pages	
•  PageRank	Importance	=		
•  Iterative	Number	of	Links	to	a	
page	and	it’s	linking	pages	
•  Twitter	Example:		who	has	the	most	
twitter	followers	
	
Graph	Algorithms:	PageRank	
Vertex=	
Web	Page	
Edge=	
Link	
Importance	depends	
on	Number	and	Rank	
of	linking	pages
11 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
Graph	Algorithms:	PageRank	
0.20	 0.20	
0.20	
0.20	 0.20	
Message	function	
Sent	from	each	
vertex
12 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
2.  Messages	are	Aggregated	and	
Calculated	at	each	destination	vertex	
3.  Sum	of	messages	becomes	new	
vertex	Page	rank	
4.  Repeat		
Graph	Algorithms:	PageRank	 Messages	Aggregated		
and		
Calculated	at	each	Vertex
13 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Many	Graph	Algorithms	Aggregate	properties	of	neighbors:	
•  PageRank	
•  Connected	Components	
•  Shortest	Path	
Graph	Algorithms	
Connected	Components	
Reference	https://en.wikipedia.org/	
Shortest	Path	A	to	F
14 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Who	should	we	recommend	for	Carol	to	Follow?	
•  Carol	follows	Oprah		
•  Oprah	follows	Reese	Witherspoon	
•  Recommend	Carol	to	follow	Reese	
Graph	Motif	Queries	
Reese
WitherspoonCarol
follows
Oprah
follows
recommend?
15 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Recommend	who	to	Follow?	Search	for	patterns	
•  A	follows	B	
•  B	follows	C	
•  A	does	not	follow	C	
Graph	Motif	Queries	
A
follows
B
follows
recommend
C
16 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter:	A	follows	B;	B	follows	C;	A	doesn’t	follow	C
graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
Graph	Query:	Motif	Find	Structural	Pattern	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
a doesn’t follow c
(b)-[]->(c)
b follows c
(a)-[]->(b)
a follows b
(b)
Search for a
pattern
17 © 2018 MapR Technologies, Inc. // MapR Confidential
Separate	Systems	
Image	reference	Spark	Summit
18 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrames:	Graph	Algorithms	+	Graph	Queries	
Image	reference	Spark	Summit
Graph	Examples
20 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter	Tweets:	
morally	outraged	tweets	retweeted	within	
political	sphere	
But	rarely	outside	sphere	
	
	
Real	World	Graphs:	Twitter	
Reference	National	Academy	of	Sciences
21 © 2018 MapR Technologies, Inc. // MapR Confidential
Recommendation	Engine:	
•  Vertices	=	Users,	Products	
•  Edges	=	Ratings	or	Purchases	
•  Calculate	how	similar	users	rated	
similar	products	
	
Graph:	Recommendation	Engines
22 © 2018 MapR Technologies, Inc. // MapR Confidential
Healthcare	Fraud:	
•  Vertices	=	Doctors,	Patients,	
Prescriptions	
•  Edges	=	prescribed	
•  Calculate	Narcotic	Abuse,	Patient	
Similarity,	Over	prescribing	
	
Real	World	Graphs:	Fraud	
Prescribed	
Prescribed	
Prescribed
23 © 2018 MapR Technologies, Inc. // MapR Confidential
Credit	Card	Aplication	Fraud:	
•  Vertices	=	Credit	Card	Applicant,	
Phone,	email,	address,	ssn	
•  Edges	=	Identifier	
•  Detect	People	sharing	identifiers	such	
as	telephone	number	
	
Real	World	Graphs:	Fraud	
Shared	
Identifier	
Phone	number	
	
Image	reference	Capitol	One	at	Spark	Summit
A	Simple	Flight	Example	with	GraphFrames
25 © 2018 MapR Technologies, Inc. // MapR Confidential
Simple	Flight	Example	with	GraphFrames 		
Originating	
Airport	
Destination	
Airport	
Distance	 Delay		
SFO	 ORD	 1800	miles	 40	
ORD	 DFW	 800	miles	 0	
DFW	 SFO	 1400	miles	 10
26 © 2018 MapR Technologies, Inc. // MapR Confidential
Vertex	Table
27 © 2018 MapR Technologies, Inc. // MapR Confidential
Edges	Table
28 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Airport(id: String, city: String)  
val airports=Array(Airport("SFO","San Francisco"),
Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))
 
val vertices = spark.createDataset(airports).toDF
vertices.show
+---+-----------------+
| id| city|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	a	Vertices	DataFrame	
Id	 City	
SFO	 San	Francisco	
ORD	 Chicago	
DFW	 Dallas
29 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Flight(id: String, src: String, dst: String,
dist: Double, delay: Double)
val flights=Array(
Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40),
Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0),
Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10))
val edges = spark.createDataset(flights).toDF
edges.show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
Create	an	Edges	DataFrame
30 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(vertices, edges)
graph.vertices.show
 
+---+-----------------+
| id| name|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	the	GraphFrame
31 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.show
 
result:
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
GraphFrame	Edges
32 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	airports	are	there?	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flights?	
	
	
Graph	Operators
33 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 3
// How many flights?
graph.edges.count
 
result: = 3
Query	the	GraphFrame
34 © 2018 MapR Technologies, Inc. // MapR Confidential
// flight routes > 800 miles distance?
graph.edges.filter("dist > 800").show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
 
Query	the	GraphFrame
Loading	and	Exploring	the	MapR-DB	Flight	
Table	with	DataFrames
36 © 2018 MapR Technologies, Inc. // MapR Confidential
How	a	Spark	Application	Runs	on	a	Cluster
37 © 2018 MapR Technologies, Inc. // MapR Confidential
•  A	Dataset	is	a	collection	of	Typed	Objects	
•  Dataset[T]		
•  (can	use	SQL	and	functions)	
•  A	DataFrame	is	a	Dataset	of	Row	objects			
•  Dataset[Row]	
•  (can	use	SQL)	
•  Partitioned	across	a	cluster	
•  Operated	on	in	parallel	
•  can	be	Cached	
	
Spark	Distributed	Datasets	
partitioned
38 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality
	
Spark	SQL	Querying	MapR-DB	JSON
39 © 2018 MapR Technologies, Inc. // MapR Confidential
Designed	for	Partitioning	and	Scaling	
Data is automatically partitioned
and sorted by id row key!
40 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB	Connector
41 © 2018 MapR Technologies, Inc. // MapR Confidential
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
Flight Dataset
Table is automatically partitioned
and sorted by id row key!
42 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by id row key!
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
43 © 2018 MapR Technologies, Inc. // MapR Confidential
Row	key	=	Table	is	Partitioned	by	src,dst	vertexes	
Data is automatically partitioned by key
range and sorted = src_dst
ATL_LGA_2017-01-01_AA_1678!
44 © 2018 MapR Technologies, Inc. // MapR Confidential
SFO
DEN
IAH
ATL
ORD
BOS
LGA
EWR
MIA
SEA	
LAX	
DFW	
Airports
45 © 2018 MapR Technologies, Inc. // MapR Confidential
Load	the	data	into	a	Dataset:	Define	the	Schema
46 © 2018 MapR Technologies, Inc. // MapR Confidential
var tableName = "/user/mapr/flighttable”
val df = spark.sparkSession
.loadFromMapRDB[Flight](tableName, schema)
Read	Dataset	from	MapR-DB	
Worker	
Task	
Worker	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task	
Task	
Driver	
tasks
tasks
tasks
47 © 2018 MapR Technologies, Inc. // MapR Confidential
df.show(5)
Show	the	first	rows	of	the	DataFrame	
columns
row
Data is automatically partitioned and
sorted by row key = src dst
ATL_BOS_2018-01-01_AA_1678!
48 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy(”src”)
.count().orderBy(desc(“count”)).show(5)
+---+-----+
|src|count|
+---+-----+
|ORD| 4033|
|ATL| 3106|
|DFW| 2782|
|EWR| 2328|
|DEN| 2304|
+---+-----+
Originating	airports	with	highest	number	of	Departure	Delays
49 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy("src")
.count.orderBy(desc("count" )).explain
== Physical Plan ==
*(3) Sort [count#549L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5], functions=[count(1)])
+- Exchange hashpartitioning(src#5, 200)
+- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)])
+- *(1) Project [src#5]
+- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay,
40.0)], ReadSchema: struct<src:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down	
Project and Filter pushed into
MapR-DB!
50 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB		Projection	Filter	push	down	
Projection and Filter pushdown reduces the
amount of data passed between MapR-DB
and the Spark engine when selecting and
filtering data.
	
Data is selected and filtered in
MapR-DB!
51 © 2018 MapR Technologies, Inc. // MapR Confidential
df.cache	
df.count()	
df.createOrReplaceTempView("flights")	
	
Long	=	282628	
Register	Dataframe	as	a	Temporary	View
52 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select carrier, avg(depdelay) from flights
group by carrier
Average	Departure	Delay	by	Carrier
53 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src, count(depdelay) from flights
where depdelay > 40 group by src
Count	of		Departure	Delays	by	Origin
54 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src,dst count(depdelay) from flights
where depdelay > 40 group by src,dst
Count	of		Departure	Delays	by	Origin,	Destination
Explore	MapR-DB	Flight	Table	with	
GraphFrames
56 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flight	routes?	
	
	
GraphFrame	and	DataFrame
57 © 2018 MapR Technologies, Inc. // MapR Confidential
val airports = spark.read.json(file)
airports.show
+-------------+-------+-----+---+
| City|Country|State| id|
+-------------+-------+-----+---+
| Chicago| USA| IL|ORD|
| New York| USA| NY|JFK|
| New York| USA| NY|LGA|
| Boston| USA| MA|BOS|
| Houston| USA| TX|IAH|
| Newark| USA| NJ|EWR|
| Denver| USA| CO|DEN|
| Miami| USA| FL|MIA|
|San Francisco| USA| CA|SFO|
| Atlanta| USA| GA|ATL|
| Dallas| USA| TX|DFW|
| Charlotte| USA| NC|CLT|
| Los Angeles| USA| CA|LAX|
| Seattle| USA| WA|SEA|
+-------------+-------+-----+---+
Read	Vertices	DataFrame	from	a	JSON	File
58 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(airports, df)
// graph.edges is a DataFrame
graph.edges.show
 
Create	the	GraphFrame
59 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents,	triangleCount	
Graph	Queries	 Motif	find
60 © 2018 MapR Technologies, Inc. // MapR Confidential
DataFrame	Queries	
Operation	 Description	
select(col) Selects	set	of	columns	
sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column	
filter(expr);
where(condition)
Filter	based	on	the	SQL	expression	or	condition	
groupBy(cols:
Columns)
Groups	DataFrame	using	specified	columns	
join (DataFrame,
joinExpr)
Joins	with	another	DataFrame	using	given	join	expression	
count Count	of	rows		
avg, count, min,
max, sum (col)
Average	,	count	,	min	,	max	on	values	in	a	group
61 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.vertices.filter("State='TX'").show
+-------+-------+-----+---+
| City|Country|State| id|
+-------+-------+-----+---+
|Houston| USA| TX|IAH|
| Dallas| USA| TX|DFW|
+-------+-------+-----+---+
Graph	Vertices	and	Edges	are	DataFrames
62 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 13
// How many flights?
graph.edges.count
 
result: = 282628
GraphFrame	DataFrame	Queries
63 © 2018 MapR Technologies, Inc. // MapR Confidential
// Show the longest distance flight routes
graph.edges.groupBy("src", "dst")
.max("dist").sort(desc("max(dist)")).show(4)
+---+---+---------+
|src|dst|max(dist)|
+---+---+---------+
|MIA|SEA| 2724.0|
|SEA|MIA| 2724.0|
|BOS|SFO| 2704.0|
|SFO|BOS| 2704.0|
+---+---+---------+ 
What	are	the	4	Longest	Distance	Flights?
64 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show
+---+---+------------------+
|src|dst| avg(depdelay)|
+---+---+------------------+
|ATL|EWR| 58.1085801063022|
|ATL|ORD| 46.42393736017897|
|ATL|DFW|39.454460966542754|
|ATL|LGA| 39.25498489425982|
|ATL|CLT| 37.56777108433735|
|ATL|SFO| 36.83008356545961|
+---+---+------------------+
What	is	the	average	delay	for	delayed	flights	from	Atlanta?
65 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain
== Physical Plan ==
*(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)])
+- Exchange hashpartitioning(src#5, dst#6, 200)
+- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)])
+- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) &&
(src#5 = ATL)) && (depdelay#9 > 1.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay),
EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema:
struct<src:string,dst:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down
66 © 2018 MapR Technologies, Inc. // MapR Confidential
z.show( graph.edges
.filter("src = 'ATL' and depdelay > 1”)
.groupBy("crsdephour")
.avg("depdelay”) )
What	is	the	Average	Delay	for	delayed	flights	from	Atlanta	by	
Hour?
67 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
68 © 2018 MapR Technologies, Inc. // MapR Confidential
WHAT	ARE	THE	HIGHEST	DEGREE	VERTEXES?	
z.show( graph.degrees.orderBy(desc("degree")) )
Which	Airports	have	the	most	incoming	and	outgoing	flights?
69 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
70 © 2018 MapR Technologies, Inc. // MapR Confidential
val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run()
ranks.vertices.orderBy($"pagerank".desc).show(5)
+-------------+-------+-----+---+-------------------+
| City|Country|State| id| pagerank|
+-------------+-------+-----+---+-------------------+
| Chicago| USA| IL|ORD| 1.5129929839358685|
| Atlanta| USA| GA|ATL| 1.4255481544216664|
| Los Angeles| USA| CA|LAX| 1.2787001001758738|
| Dallas| USA| TX|DFW| 1.1999252171688064|
| Denver| USA| CO|DEN| 1.1275194324360767|
+-------------+-------+-----+---+-------------------+
Use	Pagerank	to	find	most	important	airports
71 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
72 © 2018 MapR Technologies, Inc. // MapR Confidential
val AM = AggregateMessages
val msgToSrc = AM.edge("depdelay")
val agg = { graph
.aggregateMessages
.sendToSrc(msgToSrc)
.agg(avg(AM.msg).as("avgdelay"))}
agg.show()
+---+------------------+
| id| avgdelay|
+---+------------------+
|EWR|17.818079459546404|
|MIA|17.768691978431264|
|ORD| 16.5199551010227|
+---+------------------+
Aggregate	Messages	to	calculate	avg	delay
73 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
// how many routes?
flightroutecount.count
Long = 148
What	are	the	most	Frequent	Flight	Routes?
74 © 2018 MapR Technologies, Inc. // MapR Confidential
(HIGHEST	COUNT	OF	FLIGHTS)	
z.show (flightroutecount )
What	are	the	most	Frequent	Flight	Routes?
75 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
76 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
77 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.filter("src.State='TX'”)
.show
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|src |edge |dst |
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]|
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
DataFrames
Refine the result
78 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(src)-[edge]->(dst)")
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Motif	find	
dstsrc
edge
Search for a
pattern
79 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
Next:	use	flightroutecount	with	Motif	find
80 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
(b)-[]->(c)(a)-[]->(b)
(b)
Search for a
pattern
DataFrames
Refine the result:
Remove duplicates
81 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection
82 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
83 © 2018 MapR Technologies, Inc. // MapR Confidential
val results = graph.shortestPaths.landmarks(Seq("LGA")).run()
+---+----------+
| id| distances|
+---+----------+
|IAH|[LGA -> 1]|
|CLT|[LGA -> 1]|
|LAX|[LGA -> 2]|
|DEN|[LGA -> 1]|
|DFW|[LGA -> 1]|
|SFO|[LGA -> 2]|
|LGA|[LGA -> 0]|
|ORD|[LGA -> 1]|
|MIA|[LGA -> 1]|
|SEA|[LGA -> 2]|
|ATL|[LGA -> 1]|
|BOS|[LGA -> 1]|
|EWR|[LGA -> 2]|
+---+----------+
Compute	shortest	paths		from	each	Airport	to	LGA
84 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
85 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(1).run().show()
+----+-------+-----+---+
|City|Country|State| id|
+----+-------+-----+---+
+----+-------+-----+---+
Breadth	First	Search	for	Direct	Flights	between	LAX	and	LGA
86 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(2).run().show(5)
+--------------------+--------------------+--------------------+--------------------+--------------------+
| from| e0| v1| e1| to|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA
87 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(a)-[ab]->(b); (b)-[bc]->(c)")
.filter("a.id = 'LAX'")
.filter("c.id = 'LGA'").show(4)
Motif	Search	for	Flights	between	LAX	and	LGA		
Search for a
pattern
DataFrames
Refine the result
88 © 2018 MapR Technologies, Inc. // MapR Confidential
val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”)
.maxPathLength(3).edgeFilter("carrier = 'AA'").run()
paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate")
.select("e0.id","e1.id").show(5)
+--------------------------+--------------------------+
|id |id |
+--------------------------+--------------------------+
|LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126|
|LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954|
+--------------------------+--------------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA	with	AA	
BFS
DataFrames
Refine the result
Resources
90 © 2018 MapR Technologies, Inc. // MapR Confidential
Link	to	Code	for	this	webinar	is	in	
appendix	of	this		book.			
https://mapr.com/ebook/getting-started-
with-apache-spark-v2/	
New	Spark	Ebook
91 © 2018 MapR Technologies, Inc. // MapR Confidential
92 © 2018 MapR Technologies, Inc. // MapR Confidential
•  MapR	Free	ODT	http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
93 © 2018 MapR Technologies, Inc. // MapR Confidential
https://mapr.com/blog/	
MapR	Blog
94 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR	Data	Platform	
Link to Code for this webinar is in
appendix of the book.
https://mapr.com/ebook/getting-
started-with-apache-spark-v2/

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PPT
MySql slides (ppt)
PDF
Pregel: A System For Large Scale Graph Processing
PPTX
Mongo Nosql CRUD Operations
PDF
Introduction to MongoDB
PPTX
Introduction to Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
MySql slides (ppt)
Pregel: A System For Large Scale Graph Processing
Mongo Nosql CRUD Operations
Introduction to MongoDB
Introduction to Apache Spark

What's hot (20)

PDF
Spark DataFrames and ML Pipelines
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Introduction to Cassandra
PDF
Introduction to Apache Cassandra
PDF
Prophet at Scale: Using Prophet at scale to tune and forecast time series at ...
PPTX
An Overview of Apache Cassandra
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Adding measures to Calcite SQL
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
DPLYR package in R
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Migration to Azure Database for MySQL
PDF
Intro to Neo4j and Graph Databases
PPTX
Databricks Platform.pptx
PDF
Introduction to Data Stream Processing
PPTX
Relational algebra in DBMS
PPTX
Apache airflow
PPTX
Basic sql Commands
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Spark SQL
Spark DataFrames and ML Pipelines
Common Strategies for Improving Performance on Your Delta Lakehouse
Introduction to Cassandra
Introduction to Apache Cassandra
Prophet at Scale: Using Prophet at scale to tune and forecast time series at ...
An Overview of Apache Cassandra
Introducing DataFrames in Spark for Large Scale Data Science
Adding measures to Calcite SQL
Apache Kafka Architecture & Fundamentals Explained
DPLYR package in R
Massive Data Processing in Adobe Using Delta Lake
Migration to Azure Database for MySQL
Intro to Neo4j and Graph Databases
Databricks Platform.pptx
Introduction to Data Stream Processing
Relational algebra in DBMS
Apache airflow
Basic sql Commands
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Spark SQL
Ad

Similar to Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB (20)

PDF
Spark graphx
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
Apache Spark GraphX highlights.
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
PPT
Webinar: An Introduction to InfiniteGraph, and Connecting the Dots in Big Data.
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
Graph Analyses with Python and NetworkX
PPTX
Data Structure Graph DMZ #DMZone
PDF
Graph Algorithms - Map-Reduce Graph Processing
PPTX
Application Of Graph Data Structure
PPTX
Map reduce programming model to solve graph problems
PPTX
Cleveland Hadoop Users Group - Spark
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
PPTX
big data slides.pptx
PPTX
Gephi, Graphx, and Giraph
PDF
Graph Analytics in Spark
Spark graphx
Web-Scale Graph Analytics with Apache® Spark™
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
GraphFrames: DataFrame-based graphs for Apache® Spark™
Apache Spark GraphX highlights.
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Webinar: An Introduction to InfiniteGraph, and Connecting the Dots in Big Data.
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
An excursion into Graph Analytics with Apache Spark GraphX
Graph Analyses with Python and NetworkX
Data Structure Graph DMZ #DMZone
Graph Algorithms - Map-Reduce Graph Processing
Application Of Graph Data Structure
Map reduce programming model to solve graph problems
Cleveland Hadoop Users Group - Spark
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark
big data slides.pptx
Gephi, Graphx, and Giraph
Graph Analytics in Spark
Ad

More from Carol McDonald (20)

PDF
Introduction to machine learning with GPUs
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Streaming patterns revolutionary architectures
PDF
Spark machine learning predicting customer churn
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PDF
Applying Machine Learning to Live Patient Data
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Advanced Threat Detection on Streaming Data
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
PDF
Apache Spark Machine Learning
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Fast Cars, Big Data How Streaming can help Formula 1
Applying Machine Learning to Live Patient Data
Streaming Patterns Revolutionary Architectures with the Kafka API
Apache Spark Machine Learning Decision Trees
Advanced Threat Detection on Streaming Data
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Apache Spark Machine Learning

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administration Chapter 2
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administraation Chapter 3
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03
wealthsignaloriginal-com-DS-text-... (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
L1 - Introduction to python Backend.pptx
top salesforce developer skills in 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Reimagine Home Health with the Power of Agentic AI​
Upgrade and Innovation Strategies for SAP ERP Customers
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administration Chapter 2
How Creative Agencies Leverage Project Management Software.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administraation Chapter 3
Wondershare Filmora 15 Crack With Activation Key [2025

Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB

  • 2. 2 © 2018 MapR Technologies, Inc. // MapR Confidential Agenda •  Introduction to Graphs •  Introduction to GraphFrames with a simple Flight Dataset •  Use GraphFrames with Flight Dataset for 2018 2
  • 4. 4 © 2018 MapR Technologies, Inc. // MapR Confidential •  Graph: Models Relations between Objects •  Graph: Vertices connected by Edges •  Vertices: the objects •  Edges: the relationships between Vertices What is a Graph?
  • 5. 5 © 2018 MapR Technologies, Inc. // MapR Confidential Regular graph: each vertex has the same number of edges Example: Facebook friends – Ted is a friend of Carol – Carol is a friend of Ted Regular Graphs vs Directed Graphs
  • 6. 6 © 2018 MapR Technologies, Inc. // MapR Confidential Directed graph: edges have a direction Example: Twitter followers – Carol follows Oprah – Oprah does not follow Carol Regular Graphs vs Directed Graphs
  • 7. 7 © 2018 MapR Technologies, Inc. // MapR Confidential Property Graph: •  Edges and Vertexes have properties •  Vertex can have multiple directed edges in parallel •  Allows multiple relationships Spark GraphX supports a distributed property graph. Property Graph Properties: City,State Properties: Flight number, Distance, Delay
  • 8. 8 © 2018 MapR Technologies, Inc. // MapR Confidential What is GraphX? Spark SQL •  Structured Data •  Querying with SQL/HQL •  DataFrames Spark Streaming •  Processing of live streams •  Micro-batching MLlib •  Machine Learning •  Multiple types of ML algorithms GraphX •  Graph processing •  Graph parallel computations •  Task scheduling •  Memory management •  Fault recovery •  Interacting with storage systems Spark Core
  • 10. 10 © 2018 MapR Technologies, Inc. // MapR Confidential Web Sites •  Vertices = Web Pages •  Edges = Links between Pages •  PageRank Importance = •  Iterative Number of Links to a page and it’s linking pages •  Twitter Example: who has the most twitter followers Graph Algorithms: PageRank Vertex= Web Page Edge= Link Importance depends on Number and Rank of linking pages
  • 11. 11 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors Graph Algorithms: PageRank 0.20 0.20 0.20 0.20 0.20 Message function Sent from each vertex
  • 12. 12 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors 2.  Messages are Aggregated and Calculated at each destination vertex 3.  Sum of messages becomes new vertex Page rank 4.  Repeat Graph Algorithms: PageRank Messages Aggregated and Calculated at each Vertex
  • 13. 13 © 2018 MapR Technologies, Inc. // MapR Confidential •  Many Graph Algorithms Aggregate properties of neighbors: •  PageRank •  Connected Components •  Shortest Path Graph Algorithms Connected Components Reference https://en.wikipedia.org/ Shortest Path A to F
  • 14. 14 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Who should we recommend for Carol to Follow? •  Carol follows Oprah •  Oprah follows Reese Witherspoon •  Recommend Carol to follow Reese Graph Motif Queries Reese WitherspoonCarol follows Oprah follows recommend?
  • 15. 15 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Recommend who to Follow? Search for patterns •  A follows B •  B follows C •  A does not follow C Graph Motif Queries A follows B follows recommend C
  • 16. 16 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter: A follows B; B follows C; A doesn’t follow C graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") Graph Query: Motif Find Structural Pattern Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) a doesn’t follow c (b)-[]->(c) b follows c (a)-[]->(b) a follows b (b) Search for a pattern
  • 17. 17 © 2018 MapR Technologies, Inc. // MapR Confidential Separate Systems Image reference Spark Summit
  • 18. 18 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrames: Graph Algorithms + Graph Queries Image reference Spark Summit
  • 20. 20 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter Tweets: morally outraged tweets retweeted within political sphere But rarely outside sphere Real World Graphs: Twitter Reference National Academy of Sciences
  • 21. 21 © 2018 MapR Technologies, Inc. // MapR Confidential Recommendation Engine: •  Vertices = Users, Products •  Edges = Ratings or Purchases •  Calculate how similar users rated similar products Graph: Recommendation Engines
  • 22. 22 © 2018 MapR Technologies, Inc. // MapR Confidential Healthcare Fraud: •  Vertices = Doctors, Patients, Prescriptions •  Edges = prescribed •  Calculate Narcotic Abuse, Patient Similarity, Over prescribing Real World Graphs: Fraud Prescribed Prescribed Prescribed
  • 23. 23 © 2018 MapR Technologies, Inc. // MapR Confidential Credit Card Aplication Fraud: •  Vertices = Credit Card Applicant, Phone, email, address, ssn •  Edges = Identifier •  Detect People sharing identifiers such as telephone number Real World Graphs: Fraud Shared Identifier Phone number Image reference Capitol One at Spark Summit
  • 25. 25 © 2018 MapR Technologies, Inc. // MapR Confidential Simple Flight Example with GraphFrames Originating Airport Destination Airport Distance Delay SFO ORD 1800 miles 40 ORD DFW 800 miles 0 DFW SFO 1400 miles 10
  • 26. 26 © 2018 MapR Technologies, Inc. // MapR Confidential Vertex Table
  • 27. 27 © 2018 MapR Technologies, Inc. // MapR Confidential Edges Table
  • 28. 28 © 2018 MapR Technologies, Inc. // MapR Confidential case class Airport(id: String, city: String)   val airports=Array(Airport("SFO","San Francisco"), Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))   val vertices = spark.createDataset(airports).toDF vertices.show +---+-----------------+ | id| city| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create a Vertices DataFrame Id City SFO San Francisco ORD Chicago DFW Dallas
  • 29. 29 © 2018 MapR Technologies, Inc. // MapR Confidential case class Flight(id: String, src: String, dst: String, dist: Double, delay: Double) val flights=Array( Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40), Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0), Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10)) val edges = spark.createDataset(flights).toDF edges.show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ Create an Edges DataFrame
  • 30. 30 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(vertices, edges) graph.vertices.show   +---+-----------------+ | id| name| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create the GraphFrame
  • 31. 31 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.show   result: +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ GraphFrame Edges
  • 32. 32 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many airports are there? How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flights? Graph Operators
  • 33. 33 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 3 // How many flights? graph.edges.count   result: = 3 Query the GraphFrame
  • 34. 34 © 2018 MapR Technologies, Inc. // MapR Confidential // flight routes > 800 miles distance? graph.edges.filter("dist > 800").show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+   Query the GraphFrame
  • 36. 36 © 2018 MapR Technologies, Inc. // MapR Confidential How a Spark Application Runs on a Cluster
  • 37. 37 © 2018 MapR Technologies, Inc. // MapR Confidential •  A Dataset is a collection of Typed Objects •  Dataset[T] •  (can use SQL and functions) •  A DataFrame is a Dataset of Row objects •  Dataset[Row] •  (can use SQL) •  Partitioned across a cluster •  Operated on in parallel •  can be Cached Spark Distributed Datasets partitioned
  • 38. 38 © 2018 MapR Technologies, Inc. // MapR Confidential •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 39. 39 © 2018 MapR Technologies, Inc. // MapR Confidential Designed for Partitioning and Scaling Data is automatically partitioned and sorted by id row key!
  • 40. 40 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Connector
  • 41. 41 © 2018 MapR Technologies, Inc. // MapR Confidential { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 } Flight Dataset Table is automatically partitioned and sorted by id row key!
  • 42. 42 © 2018 MapR Technologies, Inc. // MapR Confidential MapR-DB JSON Document Store Data is automatically partitioned and sorted by id row key! { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 }
  • 43. 43 © 2018 MapR Technologies, Inc. // MapR Confidential Row key = Table is Partitioned by src,dst vertexes Data is automatically partitioned by key range and sorted = src_dst ATL_LGA_2017-01-01_AA_1678!
  • 44. 44 © 2018 MapR Technologies, Inc. // MapR Confidential SFO DEN IAH ATL ORD BOS LGA EWR MIA SEA LAX DFW Airports
  • 45. 45 © 2018 MapR Technologies, Inc. // MapR Confidential Load the data into a Dataset: Define the Schema
  • 46. 46 © 2018 MapR Technologies, Inc. // MapR Confidential var tableName = "/user/mapr/flighttable” val df = spark.sparkSession .loadFromMapRDB[Flight](tableName, schema) Read Dataset from MapR-DB Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 47. 47 © 2018 MapR Technologies, Inc. // MapR Confidential df.show(5) Show the first rows of the DataFrame columns row Data is automatically partitioned and sorted by row key = src dst ATL_BOS_2018-01-01_AA_1678!
  • 48. 48 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy(”src”) .count().orderBy(desc(“count”)).show(5) +---+-----+ |src|count| +---+-----+ |ORD| 4033| |ATL| 3106| |DFW| 2782| |EWR| 2328| |DEN| 2304| +---+-----+ Originating airports with highest number of Departure Delays
  • 49. 49 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy("src") .count.orderBy(desc("count" )).explain == Physical Plan == *(3) Sort [count#549L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5], functions=[count(1)]) +- Exchange hashpartitioning(src#5, 200) +- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)]) +- *(1) Project [src#5] +- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay, 40.0)], ReadSchema: struct<src:string,depdelay:double> MapR-DB Projection and Filter push down Project and Filter pushed into MapR-DB!
  • 50. 50 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Projection Filter push down Projection and Filter pushdown reduces the amount of data passed between MapR-DB and the Spark engine when selecting and filtering data. Data is selected and filtered in MapR-DB!
  • 51. 51 © 2018 MapR Technologies, Inc. // MapR Confidential df.cache df.count() df.createOrReplaceTempView("flights") Long = 282628 Register Dataframe as a Temporary View
  • 52. 52 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select carrier, avg(depdelay) from flights group by carrier Average Departure Delay by Carrier
  • 53. 53 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src, count(depdelay) from flights where depdelay > 40 group by src Count of Departure Delays by Origin
  • 54. 54 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src,dst count(depdelay) from flights where depdelay > 40 group by src,dst Count of Departure Delays by Origin, Destination
  • 56. 56 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flight routes? GraphFrame and DataFrame
  • 57. 57 © 2018 MapR Technologies, Inc. // MapR Confidential val airports = spark.read.json(file) airports.show +-------------+-------+-----+---+ | City|Country|State| id| +-------------+-------+-----+---+ | Chicago| USA| IL|ORD| | New York| USA| NY|JFK| | New York| USA| NY|LGA| | Boston| USA| MA|BOS| | Houston| USA| TX|IAH| | Newark| USA| NJ|EWR| | Denver| USA| CO|DEN| | Miami| USA| FL|MIA| |San Francisco| USA| CA|SFO| | Atlanta| USA| GA|ATL| | Dallas| USA| TX|DFW| | Charlotte| USA| NC|CLT| | Los Angeles| USA| CA|LAX| | Seattle| USA| WA|SEA| +-------------+-------+-----+---+ Read Vertices DataFrame from a JSON File
  • 58. 58 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(airports, df) // graph.edges is a DataFrame graph.edges.show   Create the GraphFrame
  • 59. 59 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents, triangleCount Graph Queries Motif find
  • 60. 60 © 2018 MapR Technologies, Inc. // MapR Confidential DataFrame Queries Operation Description select(col) Selects set of columns sort(sortcol) Returns new DataFrame sorted by specified column filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression count Count of rows avg, count, min, max, sum (col) Average , count , min , max on values in a group
  • 61. 61 © 2018 MapR Technologies, Inc. // MapR Confidential graph.vertices.filter("State='TX'").show +-------+-------+-----+---+ | City|Country|State| id| +-------+-------+-----+---+ |Houston| USA| TX|IAH| | Dallas| USA| TX|DFW| +-------+-------+-----+---+ Graph Vertices and Edges are DataFrames
  • 62. 62 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 13 // How many flights? graph.edges.count   result: = 282628 GraphFrame DataFrame Queries
  • 63. 63 © 2018 MapR Technologies, Inc. // MapR Confidential // Show the longest distance flight routes graph.edges.groupBy("src", "dst") .max("dist").sort(desc("max(dist)")).show(4) +---+---+---------+ |src|dst|max(dist)| +---+---+---------+ |MIA|SEA| 2724.0| |SEA|MIA| 2724.0| |BOS|SFO| 2704.0| |SFO|BOS| 2704.0| +---+---+---------+  What are the 4 Longest Distance Flights?
  • 64. 64 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show +---+---+------------------+ |src|dst| avg(depdelay)| +---+---+------------------+ |ATL|EWR| 58.1085801063022| |ATL|ORD| 46.42393736017897| |ATL|DFW|39.454460966542754| |ATL|LGA| 39.25498489425982| |ATL|CLT| 37.56777108433735| |ATL|SFO| 36.83008356545961| +---+---+------------------+ What is the average delay for delayed flights from Atlanta?
  • 65. 65 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain == Physical Plan == *(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)]) +- Exchange hashpartitioning(src#5, dst#6, 200) +- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)]) +- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) && (src#5 = ATL)) && (depdelay#9 > 1.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay), EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema: struct<src:string,dst:string,depdelay:double> MapR-DB Projection and Filter push down
  • 66. 66 © 2018 MapR Technologies, Inc. // MapR Confidential z.show( graph.edges .filter("src = 'ATL' and depdelay > 1”) .groupBy("crsdephour") .avg("depdelay”) ) What is the Average Delay for delayed flights from Atlanta by Hour?
  • 67. 67 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 68. 68 © 2018 MapR Technologies, Inc. // MapR Confidential WHAT ARE THE HIGHEST DEGREE VERTEXES? z.show( graph.degrees.orderBy(desc("degree")) ) Which Airports have the most incoming and outgoing flights?
  • 69. 69 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 70. 70 © 2018 MapR Technologies, Inc. // MapR Confidential val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run() ranks.vertices.orderBy($"pagerank".desc).show(5) +-------------+-------+-----+---+-------------------+ | City|Country|State| id| pagerank| +-------------+-------+-----+---+-------------------+ | Chicago| USA| IL|ORD| 1.5129929839358685| | Atlanta| USA| GA|ATL| 1.4255481544216664| | Los Angeles| USA| CA|LAX| 1.2787001001758738| | Dallas| USA| TX|DFW| 1.1999252171688064| | Denver| USA| CO|DEN| 1.1275194324360767| +-------------+-------+-----+---+-------------------+ Use Pagerank to find most important airports
  • 71. 71 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 72. 72 © 2018 MapR Technologies, Inc. // MapR Confidential val AM = AggregateMessages val msgToSrc = AM.edge("depdelay") val agg = { graph .aggregateMessages .sendToSrc(msgToSrc) .agg(avg(AM.msg).as("avgdelay"))} agg.show() +---+------------------+ | id| avgdelay| +---+------------------+ |EWR|17.818079459546404| |MIA|17.768691978431264| |ORD| 16.5199551010227| +---+------------------+ Aggregate Messages to calculate avg delay
  • 73. 73 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ // how many routes? flightroutecount.count Long = 148 What are the most Frequent Flight Routes?
  • 74. 74 © 2018 MapR Technologies, Inc. // MapR Confidential (HIGHEST COUNT OF FLIGHTS) z.show (flightroutecount ) What are the most Frequent Flight Routes?
  • 75. 75 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 76. 76 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge
  • 77. 77 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .filter("src.State='TX'”) .show +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |src |edge |dst | +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]| +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge DataFrames Refine the result
  • 78. 78 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(src)-[edge]->(dst)") .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Motif find dstsrc edge Search for a pattern
  • 79. 79 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ Next: use flightroutecount with Motif find
  • 80. 80 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) (b)-[]->(c)(a)-[]->(b) (b) Search for a pattern DataFrames Refine the result: Remove duplicates
  • 81. 81 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection
  • 82. 82 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 83. 83 © 2018 MapR Technologies, Inc. // MapR Confidential val results = graph.shortestPaths.landmarks(Seq("LGA")).run() +---+----------+ | id| distances| +---+----------+ |IAH|[LGA -> 1]| |CLT|[LGA -> 1]| |LAX|[LGA -> 2]| |DEN|[LGA -> 1]| |DFW|[LGA -> 1]| |SFO|[LGA -> 2]| |LGA|[LGA -> 0]| |ORD|[LGA -> 1]| |MIA|[LGA -> 1]| |SEA|[LGA -> 2]| |ATL|[LGA -> 1]| |BOS|[LGA -> 1]| |EWR|[LGA -> 2]| +---+----------+ Compute shortest paths from each Airport to LGA
  • 84. 84 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 85. 85 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(1).run().show() +----+-------+-----+---+ |City|Country|State| id| +----+-------+-----+---+ +----+-------+-----+---+ Breadth First Search for Direct Flights between LAX and LGA
  • 86. 86 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(2).run().show(5) +--------------------+--------------------+--------------------+--------------------+--------------------+ | from| e0| v1| e1| to| +--------------------+--------------------+--------------------+--------------------+--------------------+ |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| +--------------------+--------------------+--------------------+--------------------+--------------------+ Breadth First Search for Flights between LAX and LGA
  • 87. 87 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(a)-[ab]->(b); (b)-[bc]->(c)") .filter("a.id = 'LAX'") .filter("c.id = 'LGA'").show(4) Motif Search for Flights between LAX and LGA Search for a pattern DataFrames Refine the result
  • 88. 88 © 2018 MapR Technologies, Inc. // MapR Confidential val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”) .maxPathLength(3).edgeFilter("carrier = 'AA'").run() paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate") .select("e0.id","e1.id").show(5) +--------------------------+--------------------------+ |id |id | +--------------------------+--------------------------+ |LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126| |LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954| +--------------------------+--------------------------+ Breadth First Search for Flights between LAX and LGA with AA BFS DataFrames Refine the result
  • 90. 90 © 2018 MapR Technologies, Inc. // MapR Confidential Link to Code for this webinar is in appendix of this book. https://mapr.com/ebook/getting-started- with-apache-spark-v2/ New Spark Ebook
  • 91. 91 © 2018 MapR Technologies, Inc. // MapR Confidential
  • 92. 92 © 2018 MapR Technologies, Inc. // MapR Confidential •  MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 93. 93 © 2018 MapR Technologies, Inc. // MapR Confidential https://mapr.com/blog/ MapR Blog
  • 94. 94 © 2018 MapR Technologies, Inc. // MapR Confidential MapR Data Platform Link to Code for this webinar is in appendix of the book. https://mapr.com/ebook/getting- started-with-apache-spark-v2/