SlideShare a Scribd company logo
Traversing our way through
Apache Spark GraphFrames
and
GraphX
Mo Patel
Data Day Texas 2017
A bit about me
• Currently Deep Learning Practice Director atTeradata
– Road Object Detection & Scene Labeling
– Visual Product Search
– Chatbots
• Previously
– Analytics @ Social Sharing Startup
– Analytics @ Intelligence Community
– Distributed Systems @ Satellite Operations Company
– Software Engineering @ Defense Communications Program
• Research Interests: Distributed Systems for Analytics
• Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
What is this talk about?
• What are Graphs and what are some interesting
things about Graphs?
• What are some Graph Analytics Examples?
• What are GraphFrames?
• What is GraphX?
• How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
What is a Graph?
Natural Artificial
Wikipedia
Wikipedia
Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
Power of Graphs
• Good: Facebook,Twitter,WhatApp…most
popular social networks
• Bad: MySpace, Friendster, Orkut…“Nobody
goes there anymore. It's too crowded” –Yogi
Berra
• Data Growth: Recall Metcalfe’s (n2) and Reed’s
Law (2n)
• Memory Intensive
• Processing Intensive
Graph Databases cost money,
Graph Analytics make money!
Graph Databases cost money,
Graph Analytics make money!
• Page Rank, EigenCentrality
• Modularity, Clustering Coefficient,
Betweenness, Closeness
• Loopy Belief Propogation, SALSA
Node Score in a Graph
• Usecase: Find out how important an entity is
in a graph
– Entity Fraud Detection
– Influencers
– Crime Bosses
• Methods: PageRank, EigenCentrality
PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph)
EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
Communities in a Graph
• Usecase: Detect similar nodes
– Behavioral Segmentation
– Crime Rings
– Product Strength &Weakness
• Methods: Modularity, Clustering Coefficient,
Betweenness, Closeness
Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi)
Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf
(Implemented: Spark, iGraph)
Growth in Graph
• Usecase: Predict where will the graph grow or
suggest new edges
– Event Prediction
– Product Recommendation
• Methods: Loopy Belief Propagation, Belief
Networks, SALSA
Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian)
SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
GraphX
• Apache Spark Library for conducting Graph
Analytics
• Graph Operations: num[Edges,Vertices],
degress, collectNeighbors
• Graph Analytics:
– PageRank
– Connected Components
– Triangle Counter
http://spark.apache.org/graphx/
Property Graph
GraphFrame
• SQL like context is very popular
• Lots of ways to work with Graphs: Cypher, SPARQL,
Gremlin..
• Spark introduced DataFrame in February 2015
• Goal: Make it easy for DataFrame users to work with
Graphs
• GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
GraphFrame
Vertices DataFrame
val vertices =
sqlContext.createDataFrame(
List(
(“a1", “Wine", “Beverage”),
(“b2", "Beer", “Beverage”),
(“c3", “Pretzel", “Snack”),
(“d4", "Cheese", “Snack”)
)).toDF("id", "name", “type")
Edges DataFrame GraphFrame
val edges =
sqlContext.createDataFrame(
List(
("a1", “d4", 15455),
("b2", “c3", 4849),
(“a1", “c3", 40),
(“b2”, “d4”, 134)
)).toDF(“item1", “item2", “count")
val productsGraphFrame =
GraphFrame(vertices, edges)
productsGraphFrame.
vertices.filter(“type == Snack")
productsGraphFrame. numEdges
What is Synthetic Identity Fraud?
http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
Why has Synthetic Identity Fraud
emerged as a big problem?
Verafin
How are Synthetic IDs created?
Verafin
Verafin
How are Financial Companies exploited?
Verafin
What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
How can Graph Analytics helps
solve Synthetic Identity Problem?
Customer Address DataFrame
val customerAddresses =
sqlContext.createDataFrame(
List(
(“a1", “123 Main Street", “123abc456efg”),
(“b2", ”345 High Street", “123abc456efg”),
(“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
vertices.
Add Fake Address
val fakeAddress = sqlContext.createDataFrame(
List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses =
customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How can Graph Analytics helps
solve Synthetic Identity Problem?
Master Address Connection Edges
DataFrame
val masterAddressConnections = sqlContext.createDataFrame(
List(
("b2", "a1"),
("e5", "c3"),
("c3", "b2"),
("a1", "c3"),
("e5", "d4")
…
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses,
masterAddressConnections("to") ===
customerAddresses("address")).select("to","from")
val fromEdgeMatches =
masterAddressConnections.join(customerAddresses,
masterAddressConnections("from") ===
customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFrame
PageRank
val detectionGraphFrame =
GraphFrame(tempCustomerAddresses ,
checkEdges)
//PageRank
val resultRanks =
detectionGraphFrame.pageRank.resetProbability(0.
15).tol(0.01).run()
//Personalized PageRank
val d4Ranks =
detectionGraphFrame.pageRank.resetProbability(0.
15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How do we decide if this address is
fraud or not?
PageRank
id pagerank
a1 0.9463535901944437
b2 0.9463535901944437
c3 0.9463535901944437
d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1
id pagerank
a1 0.33343371928623045
c3 0.28341866139329586
b2 0.21580437563085933
d4 0.0
b2
id pagerank
b2 0.33343371928623045
a1 0.28341866139329586
c3 0.21580437563085933
d4 0.0
c2
id pagerank
c3 0.33343371928623045
b2 0.28341866139329586
a1 0.21580437563085933
d4 0.0
d4
id pagerank
d4 0.15
a1 0.0
b2 0.0
c3 0.0
Future Directions and Thoughts
• Focus on delivering value over tools and
technologies
• Will we settle on a language for Graph Analytics?
• More algorithms in GraphX?
• Large scale Graph Analytics is still not scalable
Apache Spark GraphX: http://spark.apache.org/graphx/
Follow me on Twitter (@mopatel) for interesting Deep Learning and
Analytics tweets

More Related Content

PDF
GraphRAG is All You need? LLM & Knowledge Graph
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PPTX
Querying Linked Data
PDF
Rdf入門handout
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Everything you wanted to know, but were afraid to ask about Oozie
PDF
【2018年3月時点】Oracle BI ベストプラクティス
PDF
GraphFrames: Graph Queries In Spark SQL
GraphRAG is All You need? LLM & Knowledge Graph
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Querying Linked Data
Rdf入門handout
Web-Scale Graph Analytics with Apache® Spark™
Everything you wanted to know, but were afraid to ask about Oozie
【2018年3月時点】Oracle BI ベストプラクティス
GraphFrames: Graph Queries In Spark SQL

What's hot (20)

PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Scalaで型クラス入門
PDF
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
PDF
Best practice-high availability-solution-geo-distributed-final
PDF
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Understanding Query Plans and Spark UIs
PDF
Building a Scalable Record Linkage System with Apache Spark, Python 3, and Ma...
PDF
Apache Kafka for Automotive Industry, Mobility Services & Smart City
PDF
頑張りすぎないScala
PDF
Introduction to Smart Data Models
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Hashicorp Vault: Open Source Secrets Management at #OPEN18
PPTX
Building secure applications with keycloak
PDF
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
PPTX
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
PDF
Graph based data models
PDF
FIWARE Training: Introduction to Smart Data Models
PDF
RailsAdmin - Overview and Best practices
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive Data Processing in Adobe Using Delta Lake
Scalaで型クラス入門
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Best practice-high availability-solution-geo-distributed-final
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Using Apache Spark as ETL engine. Pros and Cons
Understanding Query Plans and Spark UIs
Building a Scalable Record Linkage System with Apache Spark, Python 3, and Ma...
Apache Kafka for Automotive Industry, Mobility Services & Smart City
頑張りすぎないScala
Introduction to Smart Data Models
GraphFrames: DataFrame-based graphs for Apache® Spark™
Hashicorp Vault: Open Source Secrets Management at #OPEN18
Building secure applications with keycloak
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Graph based data models
FIWARE Training: Introduction to Smart Data Models
RailsAdmin - Overview and Best practices
Ad

Similar to Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case (20)

PDF
Offensive OSINT
PDF
Graph Database Use Cases - StampedeCon 2015
PDF
Graph database Use Cases
PPTX
Göteborg university(condensed)
PDF
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PPT
From Developer to Data Scientist
PPTX
A whirlwind tour of graph databases
PDF
From Rocket Science to Data Science
PPTX
Data Science Demystified
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
PDF
PDF
ADV Slides: Graph Databases on the Edge
PDF
How Graph Databases used in Police Department?
PPTX
Experiments in Data Portability 2
PPTX
A Practical-ish Introduction to Data Science
PPTX
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PDF
Building a New Platform for Customer Analytics
Offensive OSINT
Graph Database Use Cases - StampedeCon 2015
Graph database Use Cases
Göteborg university(condensed)
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
An excursion into Graph Analytics with Apache Spark GraphX
From Developer to Data Scientist
A whirlwind tour of graph databases
From Rocket Science to Data Science
Data Science Demystified
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
ADV Slides: Graph Databases on the Edge
How Graph Databases used in Police Department?
Experiments in Data Portability 2
A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
Building a New Platform for Customer Analytics
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

  • 1. Traversing our way through Apache Spark GraphFrames and GraphX Mo Patel Data Day Texas 2017
  • 2. A bit about me • Currently Deep Learning Practice Director atTeradata – Road Object Detection & Scene Labeling – Visual Product Search – Chatbots • Previously – Analytics @ Social Sharing Startup – Analytics @ Intelligence Community – Distributed Systems @ Satellite Operations Company – Software Engineering @ Defense Communications Program • Research Interests: Distributed Systems for Analytics • Love snowboarding and in general outdoor sports and working out to keep doing those things mopatel
  • 3. What is this talk about? • What are Graphs and what are some interesting things about Graphs? • What are some Graph Analytics Examples? • What are GraphFrames? • What is GraphX? • How can Graph Analytics help financial companies fight Synthetic Identity Fraud?
  • 4. What is a Graph? Natural Artificial Wikipedia Wikipedia
  • 5. Power of Graphs Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
  • 6. Power of Graphs • Good: Facebook,Twitter,WhatApp…most popular social networks • Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” –Yogi Berra
  • 7. • Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n) • Memory Intensive • Processing Intensive Graph Databases cost money, Graph Analytics make money!
  • 8. Graph Databases cost money, Graph Analytics make money! • Page Rank, EigenCentrality • Modularity, Clustering Coefficient, Betweenness, Closeness • Loopy Belief Propogation, SALSA
  • 9. Node Score in a Graph • Usecase: Find out how important an entity is in a graph – Entity Fraud Detection – Influencers – Crime Bosses • Methods: PageRank, EigenCentrality PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 10. Communities in a Graph • Usecase: Detect similar nodes – Behavioral Segmentation – Crime Rings – Product Strength &Weakness • Methods: Modularity, Clustering Coefficient, Betweenness, Closeness Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 11. Growth in Graph • Usecase: Predict where will the graph grow or suggest new edges – Event Prediction – Product Recommendation • Methods: Loopy Belief Propagation, Belief Networks, SALSA Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
  • 12. GraphX • Apache Spark Library for conducting Graph Analytics • Graph Operations: num[Edges,Vertices], degress, collectNeighbors • Graph Analytics: – PageRank – Connected Components – Triangle Counter http://spark.apache.org/graphx/
  • 14. GraphFrame • SQL like context is very popular • Lots of ways to work with Graphs: Cypher, SPARQL, Gremlin.. • Spark introduced DataFrame in February 2015 • Goal: Make it easy for DataFrame users to work with Graphs • GraphFrame: GraphX & DataFrame Operations https://graphframes.github.io/index.html
  • 15. GraphFrame Vertices DataFrame val vertices = sqlContext.createDataFrame( List( (“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”) )).toDF("id", "name", “type") Edges DataFrame GraphFrame val edges = sqlContext.createDataFrame( List( ("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40), (“b2”, “d4”, 134) )).toDF(“item1", “item2", “count") val productsGraphFrame = GraphFrame(vertices, edges) productsGraphFrame. vertices.filter(“type == Snack") productsGraphFrame. numEdges
  • 16. What is Synthetic Identity Fraud? http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
  • 17. Why has Synthetic Identity Fraud emerged as a big problem? Verafin
  • 18. How are Synthetic IDs created? Verafin Verafin
  • 19. How are Financial Companies exploited? Verafin
  • 20. What is the impact of Synthetic Identity Fraud? Verafin Verafin
  • 21. How can Graph Analytics helps solve Synthetic Identity Problem? Customer Address DataFrame val customerAddresses = sqlContext.createDataFrame( List( (“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") vertices. Add Fake Address val fakeAddress = sqlContext.createDataFrame( List( (“d4", “999 Ocean Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") val tempCustomerAddresses = customerAddresses.union(fakeAddress) DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 22. How can Graph Analytics helps solve Synthetic Identity Problem? Master Address Connection Edges DataFrame val masterAddressConnections = sqlContext.createDataFrame( List( ("b2", "a1"), ("e5", "c3"), ("c3", "b2"), ("a1", "c3"), ("e5", "d4") … )).toDF("src", "dst") val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from") val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from") val checkEdges = fromEdgeMatches.union(toEdgeMatches) Detection GraphFrame PageRank val detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges) //PageRank val resultRanks = detectionGraphFrame.pageRank.resetProbability(0. 15).tol(0.01).run() //Personalized PageRank val d4Ranks = detectionGraphFrame.pageRank.resetProbability(0. 15).maxIter(10).sourceId("d4").run() resultRanks.vertices.select("id", "pagerank").show() DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 23. How do we decide if this address is fraud or not? PageRank id pagerank a1 0.9463535901944437 b2 0.9463535901944437 c3 0.9463535901944437 d4 0.15 Personalized PageRank DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx a1 id pagerank a1 0.33343371928623045 c3 0.28341866139329586 b2 0.21580437563085933 d4 0.0 b2 id pagerank b2 0.33343371928623045 a1 0.28341866139329586 c3 0.21580437563085933 d4 0.0 c2 id pagerank c3 0.33343371928623045 b2 0.28341866139329586 a1 0.21580437563085933 d4 0.0 d4 id pagerank d4 0.15 a1 0.0 b2 0.0 c3 0.0
  • 24. Future Directions and Thoughts • Focus on delivering value over tools and technologies • Will we settle on a language for Graph Analytics? • More algorithms in GraphX? • Large scale Graph Analytics is still not scalable
  • 25. Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets