SlideShare a Scribd company logo
Building a Location
Based Social Graph in
Spark at InMobi
Seinjuti Chatterjee
Ian Anderson
Talk Agenda
• InMobi background
• Privacy
• Location Service
• Data collection
• Location based social groups
• Static
• Dynamic
• Spark at InMobi
InMobi: Engaging 1bn users across the globe
Global Premium Publishers
Rich MediaBanner Video Interstitial Native
Example Mobile Ads
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Ad Request
Data
Enrichment /
Processing
Ad Auction
Serve Ad
InMobi SDK and Data Collection
Target behaviours not users
Hadoop / Pig with Falcon / Oozie scheduling
HDFS store
Spark - Standalone Scheduler
Daily snapshots
Classified data
per geo region
Hadoop and Spark Setup
Feature generation pipeline
hRaven
monitoring
Decision
tree
generation
Connected
components
generation
Location
based app
similarity
Social Groups
• “A social group within social sciences has been defined
as two or more people who interact with one another,
share similar characteristics, and collectively have a
sense of unity.”
- http://en.wikipedia.org/wiki/Social_group
Uni. Friends 49ers fans Film club Work Friends Car club
Location-Based Social Groups
• Focus on location as the group membership criteria
• Shared behaviours
• Location can indicate social connection e.g. 49ers
fans at Levi’s Stadium
• San Francisco example
• We want to reach a group of people that travel into
San Francisco for business purposes
• Visual demonstration
• Implementation walkthrough
Identifying POIs
Location Data
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
POI Classifier
“Given a geographic location e.g the United States, United Kingdom, India we want to understand the
underlying nature of the location, in order to enrich the context of the ad request at the time of
origin”
• Types of POI can be Public Terminal, Hotel or Inn or B&B, Eatery, Sports Center, Community
Center, Academic Institution, Retail Store.
• POI classification helps profile user as a commuter or a frequent flyer or a university student or
frequent business traveller or avid retail therapist
• We have trained a Decision Tree Model on labelled data of POI visitation frequency patterns which
gives us upto 70% accuracy now for predicting a POI type for a location.
• The final classifier was trained on 70K labelled locations and used to label 500K locations and public
wifi’s
Average merit Average rank Attribute
1.331 +/- 0.003 1.0 +/- 0.0 normalized ssid groupsize
0.251 +/- 0.006 2.1 +/- 0.3 avg hour spread
0.241 +/- 0.001 2.9 +/- 0.3 ratio of avg weekend daytime adreq to avg weekend nighttime adreq
0.227 +/- 0.001 4.0 +/- 0.0 ratio of weekend early hours to weekend adreq
0.224 +/- 0.001 5.2 +/- 0.4 ratio of weekend late hours to weekend adreq
0.222 +/- 0.002 5.8 +/- 0.4 ratio of weekend lunch hours to weekend adreq
0.208 +/- 0.002 7.0 +/- 0.0 percentage of devices who visited at least 1 uniq days
0.202 +/- 0.001 8.0 +/- 0.0 ratio of avg weekday daytime adreq to avg weekday nighttime adreq
0.194 +/- 0.001 9.0 +/- 0.0 ratio of avg weekday to weekend adreq
0.184 +/- 0.001 10.0 +/- 0.0 ratio of weekday lunch hours to weekday adreq
0.178 +/- 0.001 11.0 +/- 0.0 ratio_of_weekend_breakfast_hours_to_weekend_adreq
0.172 +/- 0.001 12.0 +/- 0.0 percentage_of_devices_who_visited_atleast_2_uniq_days
Feature Ranking using Information Gain
Spark Implementation
• Number of Data Points = 69,465 Num Classes = 7
• Test-Train Split = 30-70%
• Impurity=”gini”, maxDepth=30, maxBins=32, numTrees=13,
featureSubsetStrategy=”auto”
• DecisionTree Classifier with holdout set
• RandomForest Classifier with holdout set
• More tuning and 10 fold cross validation is required
TP Rate FP Rate Precision Recall F-Measure Class
0.54% 0.05% 0.54% 0.54% 0.54% UNIVERSlTY_OR_COLLEGE
0.71% 0.11% 0.72% 0.72% 0.71% lNN_AND_HOTELS
0.70% 0.15% 0.71% 0.71% 0.70% EATERY
0.29% 0.08% 0.28% 0.28% 0.29% COMMUNlTY_CENTER
0.60% 0.06% 0.60% 0.60% 0.60% STORE
0.24% 0.05% 0.23% 0.24% 0.24% SPORTS_AND_FlTNESS
0.19% 0.00% 0.20% 0.19% 0.19% AlRPORT_BUS_RAlL_TERMlNAL
Detailed Accuracy by Class
DecisionTreeModel classifier of depth 30 with 22351 nodes
Decision Tree Model Results: Spark
a b c d e f g <- classified as
1100 189 174 375 90 88 4 a = UNIVERSlTY_OR_COLLEGE
201 4154 648 304 269 217 22 b = lNN_AND_HOTELS
208 633 5047 463 451 364 23 c = EATERY
335 284 389 581 158 226 9 d = COMMUNlTY_CENTER
103 246 433 157 1561 94 17 e = STORE
95 228 342 213 74 301 6 f = SPORTS_AND_FlTNESS
6 23 27 17 8 4 20 g = AlRPORT_BUS_RAlL_TERMlNAL
Decision Tree Model Results: Spark
Confusion Matrix
TP Rate FP Rate Precision Recall F-Measure Class
0.64% 0.03% 0.73% 0.68% 0.64% UNIVERSITY_OR_COLLEGE
0.82% 0.09% 0.77% 0.80% 0.82% INN_AND_HOTELS
0.85% 0.21% 0.68% 0.75% 0.85% EATERY
0.26% 0.04% 0.42% 0.32% 0.26% COMMUNITY_CENTER
0.60% 0.03% 0.77% 0.68% 0.60% STORE
0.18% 0.03% 0.31% 0.23% 0.18% SPORTS_AND_FITNESS
0.30% 0.00% 0.83% 0.44% 0.30% AIRPORT_BUS_RAIL_TERMINA
L
Detailed Accuracy by Class
TreeEnsembleModel classifier with 13 trees, Test Error = 0.308
Random Forest Results: Spark
a b c d e f g <- classified as
1303 167 251 202 53 52 0 a = UNIVERSlTY_OR_COLLEGE
47 4747 721 69 113 97 4 b = lNN_AND_HOTELS
51 537 6060 178 141 168 0 c = EATERY
327 259 632 506 109 130 2 d = COMMUNlTY_CENTER
37 170 695 94 1587 48 0 e = STORE
23 231 568 151 38 228 0 f = SPORTS_AND_FlTNESS
1 21 29 8 11 1 30 g = AlRPORT_BUS_RAlL_TERMlNAL
Random Forest Results: Spark
Confusion Matrix
Weka Implementation
Trained a Decision Tree Model in WEKA
• Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
• Relation: us.publicPOI.classification
• Filter: weka.filters.supervised.instance.SMOTE
• Instances: 69465
• Attributes: 28
• Number of Leaves : 6377
• Size of the tree : 12753
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.671 0.043 0.628 0.671 0.649 0.815 UNIVERSlTY_OR_COLLEGE
0.805 0.089 0.776 0.805 0.790 0.873 lNN_AND_HOTELS
0.802 0.113 0.788 0.802 0.795 0.859 EATERY
0.333 0.060 0.368 0.333 0.350 0.672 COMMUNlTY_CENTER
0.786 0.025 0.820 0.786 0.803 0.887 STORE
0.234 0.040 0.269 0.234 0.250 0.617 SPORTS_AND_FlTNESS
0.405 0.001 0.599 0.405 0.484 0.757 AlRPORT_BUS_RAlL_TERMlNAL
Detailed Accuracy by Class
Decision Tree Model Results: Weka
a b c d e f g <- classified as
4500 562 478 768 158 226 10 a = UNIVERSlTY_OR_COLLEGE
517 15458 1569 765 292 580 31 b = lNN_AND_HOTELS
486 1673 19145 1144 497 902 11 c = EATERY
1229 957 1219 2194 315 643 26 d = COMMUNlTY_CENTER
169 443 653 363 6849 230 7 e = STORE
257 737 1195 703 219 953 4 f = SPORTS_AND_FlTNESS
12 78 47 28 18 12 133 g = AlRPORT_BUS_RAlL_TERMlNAL
Confusion Matrix
Decision Tree Model Results: Weka
Model built with 10 fold cross validation
Model Results Comparison
Decision
Tree
Spark
Random
Forest
Spark
Baseline
Decision
Tree Weka
Weighted F-measure 0.618 0.676 0.705
Weighted Precision 0.620 0.676 0.702
Weighted Recall 0.616 0.693 0.709
Weighted TPR 0.616 0.693 0.709
Weighted FPR 0.098 0.109 0.079
Test Error Rate 0.384 0.307 0.290
Hotel and Inn F-measure 0.71% 0.79% 0.79%
University/College F-measure 0.54% 0.68% 0.65%
Eatery F-measure 0.70% 0.76% 0.79%
Store F-measure 0.60% 0.76% 0.80%
Decision
Tree
Spark
Random
Forest
Spark
Baseline
Decision
Tree Weka
Input Training
Sample SIze
70K 70K 70K
Time taken to
build model
2 mins 2 mins 35.69
seconds
Resource
Usage
Single Node
Cluster
Single Node
Cluster
Single Node
Cluster
Scalability YES YES NO
Parallelism YES YES NO
Accuracy 60% 69% 70%
Classifier Visualization: Inn/Hotel
Visualization: Airport
Visualization: University / College
Visualization: Eatery
Visualization: Store
• A connected component is a set of locations which have been frequently
co-visited by users over a month.
• Conceptually it is a subgraph of frequent visitation trends which
transforms into a profile of the user.
• 5265 connected components generated for 576093 locations in 2 hours
where each location has seen on an average 4 devices.
Examples:
• University students who like eating @BuffaloWing
• Frequent business travellers to SFO who stay at hotels and rent a car
Connected Component
Spark Implementation
L(1)
L(2)
L(3)
SSID: shopwifi_67
GPS: 37.21, -122.43
Label: STORE
L(1)
L(2)
L(3)
Days revisited: 7
Avg. hour spread: 1
Req. frequency: 0.1
1. Create vertices 2. Create edges
3. Run connected component algorithm 4. Rank users
1 user covisited
2userscovisited
● Connection strength given by number
of edges (ie. common users) between
L(i) and L(j)
● Cluster locations sorted by component
size
● Rank the users per connected
component
● Profile the user with the profile of the
connected component
University, Eateries, Target Store
Connected Component
Students, Coffee Shops, Library
Connected Component
Frequent Business Travellers to SFO
Connected Component
Frequent Business Travellers to LAX
Connected Component
University
Students
General
Population
Category
0.01% 6.65% Productivity
0.14% 5.77% Health & Fitness
0.01% 4.44% Lifestyle
3.14% 4.26% Entertainment
0.00% 3.33% News
0.03% 2.91% Reference
9.55% 2.42% Tools
4.39% 0.28% Social
5.26% 0.21% Media & Video
Using classifications: App Usage
Using classifications: App Usage
Using classifications: App Usage
• Migration towards spark as the runtime of
choice for data processing
• Legacy Pig jobs being switched to the
Spark backend using Pig on Spark
• Pure MR applications being rewritten to
use the Spark Java API
Wider Spark Usage at InMobi
Special Thanks to Paul Duff, Senior Research Scientist
Thank you
&
Questions

More Related Content

PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
PDF
Improvements to Flink & it's Applications in Alibaba Search
PDF
Dev Ops Training
PDF
Spark Summit EU talk by Stephan Kessler
PPTX
Advanced Visualization of Spark jobs
SparkApplicationDevMadeEasy_Spark_Summit_2015
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Improvements to Flink & it's Applications in Alibaba Search
Dev Ops Training
Spark Summit EU talk by Stephan Kessler
Advanced Visualization of Spark jobs

What's hot (20)

PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
PDF
Bring Satellite and Drone Imagery into your Data Science Workflows
PDF
Informational Referential Integrity Constraints Support in Apache Spark with ...
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
What’s New in the Upcoming Apache Spark 3.0
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
PDF
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
PDF
Native Support of Prometheus Monitoring in Apache Spark 3.0
PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PDF
Spark Summit EU talk by Simon Whitear
PDF
Conviva spark
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Debunking Common Myths in Stream Processing
PDF
Productionizing Machine Learning with a Microservices Architecture
PPTX
Distributed Deep Learning on Hadoop Clusters
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Improving Apache Spark for Dynamic Allocation and Spot Instances
Bring Satellite and Drone Imagery into your Data Science Workflows
Informational Referential Integrity Constraints Support in Apache Spark with ...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Self-Service Apache Spark Structured Streaming Applications and Analytics
What’s New in the Upcoming Apache Spark 3.0
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Native Support of Prometheus Monitoring in Apache Spark 3.0
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Spark Summit EU talk by Simon Whitear
Conviva spark
Jump Start with Apache Spark 2.0 on Databricks
Debunking Common Myths in Stream Processing
Productionizing Machine Learning with a Microservices Architecture
Distributed Deep Learning on Hadoop Clusters
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Ad

Viewers also liked (19)

PDF
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
PPTX
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
PDF
Apache spark
PDF
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
Мастер-класс по BigData Tools для HappyDev'15
PDF
Community detection (Поиск сообществ в графах)
PDF
GraphFrames: Graph Queries In Spark SQL
PPTX
Using spark for timeseries graph analytics
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PDF
Лекция 12. Spark
PPT
Big Graph Analytics on Neo4j with Apache Spark
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
PDF
Credit Fraud Prevention with Spark and Graph Analysis
PDF
Top 2017 Mobile Advertising Trends in Indonesia
PDF
Graph Analytics in Spark
PDF
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
Apache spark
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Мастер-класс по BigData Tools для HappyDev'15
Community detection (Поиск сообществ в графах)
GraphFrames: Graph Queries In Spark SQL
Using spark for timeseries graph analytics
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Graphs are everywhere! Distributed graph computing with Spark GraphX
Лекция 12. Spark
Big Graph Analytics on Neo4j with Apache Spark
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Credit Fraud Prevention with Spark and Graph Analysis
Top 2017 Mobile Advertising Trends in Indonesia
Graph Analytics in Spark
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Ad

Similar to Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi) (20)

PPTX
Quettra Design Problem Solution - Deepti Chafekar
PDF
As simple as Apache Spark
PPTX
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
PDF
Semantic Labeling of Places
PDF
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
PPTX
Predictive maintenance withsensors_in_utilities_
PDF
Energy analytics with Apache Spark workshop
PPTX
Machine Learning With Spark
PPTX
Spark-Zeppelin-ML on HWX
PDF
FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
AdClickFraud_Bigdata-Apic-Ist-2019
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PDF
Training Large-scale Ad Ranking Models in Spark
PDF
900 keynote abbott
PDF
Path recommender rec tour
PDF
Advanced Analytics With Spark Patterns For Learning From Data At Scale 2nd Ed...
Quettra Design Problem Solution - Deepti Chafekar
As simple as Apache Spark
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Semantic Labeling of Places
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Predictive maintenance withsensors_in_utilities_
Energy analytics with Apache Spark workshop
Machine Learning With Spark
Spark-Zeppelin-ML on HWX
FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
AdClickFraud_Bigdata-Apic-Ist-2019
Spark Summit EMEA - Arun Murthy's Keynote
Spark and Hadoop Perfect Togeher by Arun Murthy
Training Large-scale Ad Ranking Models in Spark
900 keynote abbott
Path recommender rec tour
Advanced Analytics With Spark Patterns For Learning From Data At Scale 2nd Ed...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to Data Science and Data Analysis
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Modelling in Business Intelligence , information system
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
How to run a consulting project- client discovery
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Predictive modeling basics in data cleaning process
Microsoft Core Cloud Services powerpoint
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Data Science and Data Analysis
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
annual-report-2024-2025 original latest.
A Complete Guide to Streamlining Business Processes
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Pilar Kemerdekaan dan Identi Bangsa.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Optimise Shopper Experiences with a Strong Data Estate.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Modelling in Business Intelligence , information system
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
How to run a consulting project- client discovery

Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)

  • 1. Building a Location Based Social Graph in Spark at InMobi Seinjuti Chatterjee Ian Anderson
  • 2. Talk Agenda • InMobi background • Privacy • Location Service • Data collection • Location based social groups • Static • Dynamic • Spark at InMobi
  • 3. InMobi: Engaging 1bn users across the globe
  • 5. Rich MediaBanner Video Interstitial Native Example Mobile Ads
  • 8. Ad Request Data Enrichment / Processing Ad Auction Serve Ad InMobi SDK and Data Collection Target behaviours not users
  • 9. Hadoop / Pig with Falcon / Oozie scheduling HDFS store Spark - Standalone Scheduler Daily snapshots Classified data per geo region Hadoop and Spark Setup Feature generation pipeline hRaven monitoring Decision tree generation Connected components generation Location based app similarity
  • 10. Social Groups • “A social group within social sciences has been defined as two or more people who interact with one another, share similar characteristics, and collectively have a sense of unity.” - http://en.wikipedia.org/wiki/Social_group Uni. Friends 49ers fans Film club Work Friends Car club
  • 11. Location-Based Social Groups • Focus on location as the group membership criteria • Shared behaviours • Location can indicate social connection e.g. 49ers fans at Levi’s Stadium • San Francisco example • We want to reach a group of people that travel into San Francisco for business purposes • Visual demonstration • Implementation walkthrough
  • 16. POI Classifier “Given a geographic location e.g the United States, United Kingdom, India we want to understand the underlying nature of the location, in order to enrich the context of the ad request at the time of origin” • Types of POI can be Public Terminal, Hotel or Inn or B&B, Eatery, Sports Center, Community Center, Academic Institution, Retail Store. • POI classification helps profile user as a commuter or a frequent flyer or a university student or frequent business traveller or avid retail therapist • We have trained a Decision Tree Model on labelled data of POI visitation frequency patterns which gives us upto 70% accuracy now for predicting a POI type for a location. • The final classifier was trained on 70K labelled locations and used to label 500K locations and public wifi’s
  • 17. Average merit Average rank Attribute 1.331 +/- 0.003 1.0 +/- 0.0 normalized ssid groupsize 0.251 +/- 0.006 2.1 +/- 0.3 avg hour spread 0.241 +/- 0.001 2.9 +/- 0.3 ratio of avg weekend daytime adreq to avg weekend nighttime adreq 0.227 +/- 0.001 4.0 +/- 0.0 ratio of weekend early hours to weekend adreq 0.224 +/- 0.001 5.2 +/- 0.4 ratio of weekend late hours to weekend adreq 0.222 +/- 0.002 5.8 +/- 0.4 ratio of weekend lunch hours to weekend adreq 0.208 +/- 0.002 7.0 +/- 0.0 percentage of devices who visited at least 1 uniq days 0.202 +/- 0.001 8.0 +/- 0.0 ratio of avg weekday daytime adreq to avg weekday nighttime adreq 0.194 +/- 0.001 9.0 +/- 0.0 ratio of avg weekday to weekend adreq 0.184 +/- 0.001 10.0 +/- 0.0 ratio of weekday lunch hours to weekday adreq 0.178 +/- 0.001 11.0 +/- 0.0 ratio_of_weekend_breakfast_hours_to_weekend_adreq 0.172 +/- 0.001 12.0 +/- 0.0 percentage_of_devices_who_visited_atleast_2_uniq_days Feature Ranking using Information Gain
  • 18. Spark Implementation • Number of Data Points = 69,465 Num Classes = 7 • Test-Train Split = 30-70% • Impurity=”gini”, maxDepth=30, maxBins=32, numTrees=13, featureSubsetStrategy=”auto” • DecisionTree Classifier with holdout set • RandomForest Classifier with holdout set • More tuning and 10 fold cross validation is required
  • 19. TP Rate FP Rate Precision Recall F-Measure Class 0.54% 0.05% 0.54% 0.54% 0.54% UNIVERSlTY_OR_COLLEGE 0.71% 0.11% 0.72% 0.72% 0.71% lNN_AND_HOTELS 0.70% 0.15% 0.71% 0.71% 0.70% EATERY 0.29% 0.08% 0.28% 0.28% 0.29% COMMUNlTY_CENTER 0.60% 0.06% 0.60% 0.60% 0.60% STORE 0.24% 0.05% 0.23% 0.24% 0.24% SPORTS_AND_FlTNESS 0.19% 0.00% 0.20% 0.19% 0.19% AlRPORT_BUS_RAlL_TERMlNAL Detailed Accuracy by Class DecisionTreeModel classifier of depth 30 with 22351 nodes Decision Tree Model Results: Spark
  • 20. a b c d e f g <- classified as 1100 189 174 375 90 88 4 a = UNIVERSlTY_OR_COLLEGE 201 4154 648 304 269 217 22 b = lNN_AND_HOTELS 208 633 5047 463 451 364 23 c = EATERY 335 284 389 581 158 226 9 d = COMMUNlTY_CENTER 103 246 433 157 1561 94 17 e = STORE 95 228 342 213 74 301 6 f = SPORTS_AND_FlTNESS 6 23 27 17 8 4 20 g = AlRPORT_BUS_RAlL_TERMlNAL Decision Tree Model Results: Spark Confusion Matrix
  • 21. TP Rate FP Rate Precision Recall F-Measure Class 0.64% 0.03% 0.73% 0.68% 0.64% UNIVERSITY_OR_COLLEGE 0.82% 0.09% 0.77% 0.80% 0.82% INN_AND_HOTELS 0.85% 0.21% 0.68% 0.75% 0.85% EATERY 0.26% 0.04% 0.42% 0.32% 0.26% COMMUNITY_CENTER 0.60% 0.03% 0.77% 0.68% 0.60% STORE 0.18% 0.03% 0.31% 0.23% 0.18% SPORTS_AND_FITNESS 0.30% 0.00% 0.83% 0.44% 0.30% AIRPORT_BUS_RAIL_TERMINA L Detailed Accuracy by Class TreeEnsembleModel classifier with 13 trees, Test Error = 0.308 Random Forest Results: Spark
  • 22. a b c d e f g <- classified as 1303 167 251 202 53 52 0 a = UNIVERSlTY_OR_COLLEGE 47 4747 721 69 113 97 4 b = lNN_AND_HOTELS 51 537 6060 178 141 168 0 c = EATERY 327 259 632 506 109 130 2 d = COMMUNlTY_CENTER 37 170 695 94 1587 48 0 e = STORE 23 231 568 151 38 228 0 f = SPORTS_AND_FlTNESS 1 21 29 8 11 1 30 g = AlRPORT_BUS_RAlL_TERMlNAL Random Forest Results: Spark Confusion Matrix
  • 23. Weka Implementation Trained a Decision Tree Model in WEKA • Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 • Relation: us.publicPOI.classification • Filter: weka.filters.supervised.instance.SMOTE • Instances: 69465 • Attributes: 28 • Number of Leaves : 6377 • Size of the tree : 12753
  • 24. TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.671 0.043 0.628 0.671 0.649 0.815 UNIVERSlTY_OR_COLLEGE 0.805 0.089 0.776 0.805 0.790 0.873 lNN_AND_HOTELS 0.802 0.113 0.788 0.802 0.795 0.859 EATERY 0.333 0.060 0.368 0.333 0.350 0.672 COMMUNlTY_CENTER 0.786 0.025 0.820 0.786 0.803 0.887 STORE 0.234 0.040 0.269 0.234 0.250 0.617 SPORTS_AND_FlTNESS 0.405 0.001 0.599 0.405 0.484 0.757 AlRPORT_BUS_RAlL_TERMlNAL Detailed Accuracy by Class Decision Tree Model Results: Weka
  • 25. a b c d e f g <- classified as 4500 562 478 768 158 226 10 a = UNIVERSlTY_OR_COLLEGE 517 15458 1569 765 292 580 31 b = lNN_AND_HOTELS 486 1673 19145 1144 497 902 11 c = EATERY 1229 957 1219 2194 315 643 26 d = COMMUNlTY_CENTER 169 443 653 363 6849 230 7 e = STORE 257 737 1195 703 219 953 4 f = SPORTS_AND_FlTNESS 12 78 47 28 18 12 133 g = AlRPORT_BUS_RAlL_TERMlNAL Confusion Matrix Decision Tree Model Results: Weka Model built with 10 fold cross validation
  • 26. Model Results Comparison Decision Tree Spark Random Forest Spark Baseline Decision Tree Weka Weighted F-measure 0.618 0.676 0.705 Weighted Precision 0.620 0.676 0.702 Weighted Recall 0.616 0.693 0.709 Weighted TPR 0.616 0.693 0.709 Weighted FPR 0.098 0.109 0.079 Test Error Rate 0.384 0.307 0.290 Hotel and Inn F-measure 0.71% 0.79% 0.79% University/College F-measure 0.54% 0.68% 0.65% Eatery F-measure 0.70% 0.76% 0.79% Store F-measure 0.60% 0.76% 0.80% Decision Tree Spark Random Forest Spark Baseline Decision Tree Weka Input Training Sample SIze 70K 70K 70K Time taken to build model 2 mins 2 mins 35.69 seconds Resource Usage Single Node Cluster Single Node Cluster Single Node Cluster Scalability YES YES NO Parallelism YES YES NO Accuracy 60% 69% 70%
  • 32. • A connected component is a set of locations which have been frequently co-visited by users over a month. • Conceptually it is a subgraph of frequent visitation trends which transforms into a profile of the user. • 5265 connected components generated for 576093 locations in 2 hours where each location has seen on an average 4 devices. Examples: • University students who like eating @BuffaloWing • Frequent business travellers to SFO who stay at hotels and rent a car Connected Component
  • 33. Spark Implementation L(1) L(2) L(3) SSID: shopwifi_67 GPS: 37.21, -122.43 Label: STORE L(1) L(2) L(3) Days revisited: 7 Avg. hour spread: 1 Req. frequency: 0.1 1. Create vertices 2. Create edges 3. Run connected component algorithm 4. Rank users 1 user covisited 2userscovisited ● Connection strength given by number of edges (ie. common users) between L(i) and L(j) ● Cluster locations sorted by component size ● Rank the users per connected component ● Profile the user with the profile of the connected component
  • 34. University, Eateries, Target Store Connected Component
  • 35. Students, Coffee Shops, Library Connected Component
  • 36. Frequent Business Travellers to SFO Connected Component
  • 37. Frequent Business Travellers to LAX Connected Component
  • 38. University Students General Population Category 0.01% 6.65% Productivity 0.14% 5.77% Health & Fitness 0.01% 4.44% Lifestyle 3.14% 4.26% Entertainment 0.00% 3.33% News 0.03% 2.91% Reference 9.55% 2.42% Tools 4.39% 0.28% Social 5.26% 0.21% Media & Video Using classifications: App Usage
  • 41. • Migration towards spark as the runtime of choice for data processing • Legacy Pig jobs being switched to the Spark backend using Pig on Spark • Pure MR applications being rewritten to use the Spark Java API Wider Spark Usage at InMobi
  • 42. Special Thanks to Paul Duff, Senior Research Scientist Thank you & Questions