SlideShare a Scribd company logo
3rd Workshop on Graph-based Technologies and
Applications (Graph-TA)
UPC, Barcelona
Introduction
● Motivation: one of the difficulties for data analysts of
online social networks is the public availability of data,
while respecting the privacy of the users.
● Solution: use synthetically generated data [1].
However, this presents a series of challenges related to
generating a realistic dataset in terms of topologies,
attribute values, communities, data distributions, and so
on.
Introduction
 In the following we present an approach for
generating a graph topology and populating it with
synthetic data to simulate an online social network.
 3 Steps
Data Definition
Topology
Generation
Data
Population
Step 1: Topology generation
We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat
employs a statistical approach and a recursive process to replicate the power law
distributions, skew distributions and community structure (which can be hierarchical),
while maintaining a small diameter for the graph.
The adjacency matrix A of a graph of N nodes is an N  N
matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0
otherwise.
The adjacency matrix is recursively subdivided into four
equal-sized partitions, and edges are distributed within
these partitions with a unequal probabilities.
Starting from an empty adjacency matrix, edges are
assigned into the matrix one at a time.
For each edge, one of the four partitions is assigned with
probabilities a, b, c, d respectively (a + b + c + d = 1).
The chosen partition is again subdivided into four smaller
partitions, and the procedure is repeated until a simple
cell is obtained (1  1 partition).
The edge is then assigned to this cell of the adjacency
matrix.
Fig. 1. R-mat model
In the current work, we have used the
following probabilities: a=0.45, b=0.15,
c=0.15, d=0.25
Step 1: Topology generation
We identify the communities in the
graph structure generated by Rmat, by
processing it with the Louvain[4]
method, which assigns a community
label to each vertex in the graph.
Fig. 2a. Rmat generated topology.
Different colors indicate
communities.
Step 1: Topology generation
Next we identify a set of seed vertices
which will be used as the starting
points for propagating the data.
• For each community, we find
the medoid node in terms of the
statistical and topological
qualities (especially centrality
and degree).
• We can progressively assign
seed vertices which give an
optimal coverage of the
complete graph, while not
overlapping in their immediate
neighborhoods (to avoid
overwriting data propagated
from different seeds).
Fig. 2b. Rmat generated topology
with seed nodes indicated.
Step 2: Data definition
• The choice of data will be application specific.
• The distributions of the values of the different attributes should be similar to
those of some real social network (ground truth).
• In order to achieve this, we can use sources of official statistics, such as
government census data
 www.indexmundi.com, www.census.gov, www.bls.gov
• Statistical summaries made public by the social network providers, such as
Facebook:
 www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males-
vs-females-in-social-networks/
• We may also need some lookup tables for inter-related attribute values.
 For example, for the age group “18-25”, there will be a much higher proportion of
“profession=student” and “marital status=single”, and for “gender=male” there will be a
higher proportion of “like3=soccer club”.
• Another option for ‘ground truth’ are publicly available OSN datasets,
such as:
 SNAP: http://snap.stanford.edu/data/#communities
Step 2: Data definition
Attribute Values
Age "18-25", (32%) "26-35", (26%) "36-45" (20%) ,"46-55" (13%) ,"56-65" (9%)
Gender male (47%), female (53%)
Residence "Palo Alto“ (17%), "Santa Barbara“ (16%), "Boca Raton“ (16%), "Boston“ (17%), "Norfolk“ (17%),
"San Jose“ (17%)
Religion "Christian" (31.9%), "Hindu" (14.8%), "Jewish" (0.2%), "Muslim" (27.1%), "Sikh" (0.3%),
"Traditional Spirituality" (0.1%),
"Other Religions" (12.9%), "No religious affiliation" (12.7%)
Marital status "Single" (31.5%), "Married" (51.4%), "Divorced" (10.5%), "Widowed" (6.6%)
Profession
(ISCO-08
structure)
"Manager" (12.2%), "Professional" (17.1%), "Service" (13.9%), "Sales and office" (17.8%), “Student”
(23%),
"Natural resources construction and maintenance" (7.0%), "Production transportation and material
moving" (9.0%)
Political
orientation
"Far Left" (9.4%), "Left" (34.7%),"Center Left", (18.1%), "Center" (18.0%), "Center Right" (10.5%),
"Right" (8.0%), "Far Right" (1.3%)
{like1, like2,
like3}
Patterns: {"entertainment", "entertainment", "music artist"} (25%),
{"music artist", "music artist", "entertainment"} (25%),
{"drink brand", "drink brand", "entertainment"} (25%),
{"tv show", "drink brand", "soccer club"} (25%).
Table 1. Example attributes, attribute-values and their proportions.
Step 3: Data population
 In order to optimize the assignment, we can use a fitness function and find the
optimum configuration using a stochastic process.
Fitness = (, , )
where  = set of seed assignments,  = set of data propagation rules and thresholds,  = set
of required data distributions.
 We initiate the data population of the network using the set of seed nodes.
 The rest of the nodes will be assigned data by a propagation from the seed nodes.
 The immediate neighbors of a seed will have a higher probability of being assigned similar
attribute values.
 For each hop further from the seed node, the influence of the seed node on other nodes is
progressively reduced. However, we give precedence to seed nodes in the same community as the
node to be assigned.
 We can also use ontologies /taxonomies and a distance measure to assign similar, rather than
identical values (with an appropriate threshold) when propagating attribute-values.
 The influence on assignment by the seed node has to be traded off by the desired overall
proportions of the attribute values (diversity).
Step 3: Data population
Fig. 3. Data propagation from seed
nodes and topology of Fig. 1.
Challenges
 A high degree node chosen as a seed may
have a disproportionate influence on the
network.
 The synthetic data generation
may create fictitious user
profiles (combinations of
attribute-values).
 The restrictions on the
placement of the seed
nodes and the topology of
the communities may cause
a significant percentage of
nodes to have random
assignments (coverage).
 Making the communities
representative of key
attribute-value profiles
 Obtaining ‘ground truth’.
Acknowledgements / References
References
[1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN
Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014.
[2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science
Review 7, 1-34 (2013).
[3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc.
SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA.
[4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in
large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp.
1000.
Acknowledgements
This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).
Thank you for your
attention !

More Related Content

PDF
Complex networks - Assortativity
PDF
Synthetic Data Generation using exponential random Graph modeling
PDF
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
DOCX
Link Prediction Survey
PPT
Learning Social Networks From Web Documents Using Support
PPTX
PPTX
Group and Community Detection in Social Networks
PDF
Exploratory social network analysis with pajek
Complex networks - Assortativity
Synthetic Data Generation using exponential random Graph modeling
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
Link Prediction Survey
Learning Social Networks From Web Documents Using Support
Group and Community Detection in Social Networks
Exploratory social network analysis with pajek

What's hot (20)

PPTX
Social Network Analysis with Spark
PDF
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...
PPTX
Visualizing Big Data - Social Network Analysis
PDF
Predicting_new_friendships_in_social_networks
PDF
Ph.D. defense: semantic social network analysis
PDF
COMPLETEpaperAOAS
PDF
Data mining based social network
PPT
Evolving social data mining and affective analysis
PDF
Social network analysis & Big Data - Telecommunications and more
PPTX
Community detection in complex social networks
PPT
01 Introduction to Networks Methods and Measures
PPTX
Data Mining In Social Networks Using K-Means Clustering Algorithm
PPT
Social network analysis course 2010 - 2011
PDF
Finding prominent features in communities in social networks using ontology
PPTX
02 Network Data Collection
PPT
One Tag to bind them all: Measuring Term abstractness in Social Metadata
PPTX
Using content and interactions for discovering communities in
PDF
Community Detection in Social Networks: A Brief Overview
PDF
Survey on Location Based Recommendation System Using POI
PPTX
COM494_SNA_DataPrep&NetworkTypes
Social Network Analysis with Spark
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...
Visualizing Big Data - Social Network Analysis
Predicting_new_friendships_in_social_networks
Ph.D. defense: semantic social network analysis
COMPLETEpaperAOAS
Data mining based social network
Evolving social data mining and affective analysis
Social network analysis & Big Data - Telecommunications and more
Community detection in complex social networks
01 Introduction to Networks Methods and Measures
Data Mining In Social Networks Using K-Means Clustering Algorithm
Social network analysis course 2010 - 2011
Finding prominent features in communities in social networks using ontology
02 Network Data Collection
One Tag to bind them all: Measuring Term abstractness in Social Metadata
Using content and interactions for discovering communities in
Community Detection in Social Networks: A Brief Overview
Survey on Location Based Recommendation System Using POI
COM494_SNA_DataPrep&NetworkTypes
Ad

Similar to Generating synthetic online social network graph data and topologies (20)

PDF
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
PPTX
social.pptx
PDF
AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS
PDF
Analyzing the formation of groups in a network adapting the modularity concept
PDF
a modified weight balanced algorithm for influential users community detectio...
PDF
Social Media Mining _indian edition available.pdf
PDF
Social Media Mining _indian edition available.pdf
PDF
Maximizing the Diversity of Exposure in a Social Network
PDF
Mining social data
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
PDF
Social And Economic Networks Matthew O Jackson
PDF
Anwar_Shahed_MSc_2015
PDF
08 Exponential Random Graph Models (ERGM)
PDF
08 Exponential Random Graph Models (2016)
PDF
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
PDF
Fast Graphlet Decomposition: Theory, Algorithms, and Applications
PDF
SocialCom09-tutorial.pdf
PDF
Final_Thesis
PDF
Mesoscale Structures in Networks
PDF
Mesoscale Structures in Networks - Mason A. Porter
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
social.pptx
AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS
Analyzing the formation of groups in a network adapting the modularity concept
a modified weight balanced algorithm for influential users community detectio...
Social Media Mining _indian edition available.pdf
Social Media Mining _indian edition available.pdf
Maximizing the Diversity of Exposure in a Social Network
Mining social data
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social And Economic Networks Matthew O Jackson
Anwar_Shahed_MSc_2015
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (2016)
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
Fast Graphlet Decomposition: Theory, Algorithms, and Applications
SocialCom09-tutorial.pdf
Final_Thesis
Mesoscale Structures in Networks
Mesoscale Structures in Networks - Mason A. Porter
Ad

More from Graph-TA (20)

PDF
Computing on Event-sourced Graphs
PDF
Using Evolutionary Computing for Feature-driven Graph generation
PDF
Reactive Databases for Big Data applications
PDF
The scarcity of crossing dependencies: a direct outcome of a specific constra...
PDF
Holistic Benchmarking of Big Linked Data: HOBBIT
PDF
Identifiability in Dynamic Casual Networks
PDF
Polyglot Graph Databases using OCL as pivot
PDF
Benchmarking Versioning for Big Linked Data
PDF
Use of Graphs for Cloud Service Selection in Multi-Cloud Environments
PDF
Graphalytics: A big data benchmark for graph-processing platforms
PDF
Modelling the Clustering Coefficient of a Random graph
PPTX
RDF Graph Data Management in Oracle Database and NoSQL Platforms
PPTX
GRAPHITE — An Extensible Graph Traversal Framework for RDBMS
PPTX
On the Discovery of Novel Drug-Target Interactions from Dense SubGraphs
PDF
Graphalytics: A big data benchmark for graph processing platforms
PDF
Autograph: an evolving lightweight graph tool
PPTX
Understanding Graph Structure in Knowledge Bases
PDF
Finding patterns of chronic disease and medication prescriptions from a large...
PDF
Recent Updates on IBM System G — GraphBIG and Temporal Data
PDF
Analysing the degree distribution of real graphs by means of several probabil...
Computing on Event-sourced Graphs
Using Evolutionary Computing for Feature-driven Graph generation
Reactive Databases for Big Data applications
The scarcity of crossing dependencies: a direct outcome of a specific constra...
Holistic Benchmarking of Big Linked Data: HOBBIT
Identifiability in Dynamic Casual Networks
Polyglot Graph Databases using OCL as pivot
Benchmarking Versioning for Big Linked Data
Use of Graphs for Cloud Service Selection in Multi-Cloud Environments
Graphalytics: A big data benchmark for graph-processing platforms
Modelling the Clustering Coefficient of a Random graph
RDF Graph Data Management in Oracle Database and NoSQL Platforms
GRAPHITE — An Extensible Graph Traversal Framework for RDBMS
On the Discovery of Novel Drug-Target Interactions from Dense SubGraphs
Graphalytics: A big data benchmark for graph processing platforms
Autograph: an evolving lightweight graph tool
Understanding Graph Structure in Knowledge Bases
Finding patterns of chronic disease and medication prescriptions from a large...
Recent Updates on IBM System G — GraphBIG and Temporal Data
Analysing the degree distribution of real graphs by means of several probabil...

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
sap open course for s4hana steps from ECC to s4
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf

Generating synthetic online social network graph data and topologies

  • 1. 3rd Workshop on Graph-based Technologies and Applications (Graph-TA) UPC, Barcelona
  • 2. Introduction ● Motivation: one of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. ● Solution: use synthetically generated data [1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on.
  • 3. Introduction  In the following we present an approach for generating a graph topology and populating it with synthetic data to simulate an online social network.  3 Steps Data Definition Topology Generation Data Population
  • 4. Step 1: Topology generation We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat employs a statistical approach and a recursive process to replicate the power law distributions, skew distributions and community structure (which can be hierarchical), while maintaining a small diameter for the graph. The adjacency matrix A of a graph of N nodes is an N  N matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0 otherwise. The adjacency matrix is recursively subdivided into four equal-sized partitions, and edges are distributed within these partitions with a unequal probabilities. Starting from an empty adjacency matrix, edges are assigned into the matrix one at a time. For each edge, one of the four partitions is assigned with probabilities a, b, c, d respectively (a + b + c + d = 1). The chosen partition is again subdivided into four smaller partitions, and the procedure is repeated until a simple cell is obtained (1  1 partition). The edge is then assigned to this cell of the adjacency matrix. Fig. 1. R-mat model In the current work, we have used the following probabilities: a=0.45, b=0.15, c=0.15, d=0.25
  • 5. Step 1: Topology generation We identify the communities in the graph structure generated by Rmat, by processing it with the Louvain[4] method, which assigns a community label to each vertex in the graph. Fig. 2a. Rmat generated topology. Different colors indicate communities.
  • 6. Step 1: Topology generation Next we identify a set of seed vertices which will be used as the starting points for propagating the data. • For each community, we find the medoid node in terms of the statistical and topological qualities (especially centrality and degree). • We can progressively assign seed vertices which give an optimal coverage of the complete graph, while not overlapping in their immediate neighborhoods (to avoid overwriting data propagated from different seeds). Fig. 2b. Rmat generated topology with seed nodes indicated.
  • 7. Step 2: Data definition • The choice of data will be application specific. • The distributions of the values of the different attributes should be similar to those of some real social network (ground truth). • In order to achieve this, we can use sources of official statistics, such as government census data  www.indexmundi.com, www.census.gov, www.bls.gov • Statistical summaries made public by the social network providers, such as Facebook:  www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males- vs-females-in-social-networks/ • We may also need some lookup tables for inter-related attribute values.  For example, for the age group “18-25”, there will be a much higher proportion of “profession=student” and “marital status=single”, and for “gender=male” there will be a higher proportion of “like3=soccer club”. • Another option for ‘ground truth’ are publicly available OSN datasets, such as:  SNAP: http://snap.stanford.edu/data/#communities
  • 8. Step 2: Data definition Attribute Values Age "18-25", (32%) "26-35", (26%) "36-45" (20%) ,"46-55" (13%) ,"56-65" (9%) Gender male (47%), female (53%) Residence "Palo Alto“ (17%), "Santa Barbara“ (16%), "Boca Raton“ (16%), "Boston“ (17%), "Norfolk“ (17%), "San Jose“ (17%) Religion "Christian" (31.9%), "Hindu" (14.8%), "Jewish" (0.2%), "Muslim" (27.1%), "Sikh" (0.3%), "Traditional Spirituality" (0.1%), "Other Religions" (12.9%), "No religious affiliation" (12.7%) Marital status "Single" (31.5%), "Married" (51.4%), "Divorced" (10.5%), "Widowed" (6.6%) Profession (ISCO-08 structure) "Manager" (12.2%), "Professional" (17.1%), "Service" (13.9%), "Sales and office" (17.8%), “Student” (23%), "Natural resources construction and maintenance" (7.0%), "Production transportation and material moving" (9.0%) Political orientation "Far Left" (9.4%), "Left" (34.7%),"Center Left", (18.1%), "Center" (18.0%), "Center Right" (10.5%), "Right" (8.0%), "Far Right" (1.3%) {like1, like2, like3} Patterns: {"entertainment", "entertainment", "music artist"} (25%), {"music artist", "music artist", "entertainment"} (25%), {"drink brand", "drink brand", "entertainment"} (25%), {"tv show", "drink brand", "soccer club"} (25%). Table 1. Example attributes, attribute-values and their proportions.
  • 9. Step 3: Data population  In order to optimize the assignment, we can use a fitness function and find the optimum configuration using a stochastic process. Fitness = (, , ) where  = set of seed assignments,  = set of data propagation rules and thresholds,  = set of required data distributions.  We initiate the data population of the network using the set of seed nodes.  The rest of the nodes will be assigned data by a propagation from the seed nodes.  The immediate neighbors of a seed will have a higher probability of being assigned similar attribute values.  For each hop further from the seed node, the influence of the seed node on other nodes is progressively reduced. However, we give precedence to seed nodes in the same community as the node to be assigned.  We can also use ontologies /taxonomies and a distance measure to assign similar, rather than identical values (with an appropriate threshold) when propagating attribute-values.  The influence on assignment by the seed node has to be traded off by the desired overall proportions of the attribute values (diversity).
  • 10. Step 3: Data population Fig. 3. Data propagation from seed nodes and topology of Fig. 1.
  • 11. Challenges  A high degree node chosen as a seed may have a disproportionate influence on the network.  The synthetic data generation may create fictitious user profiles (combinations of attribute-values).  The restrictions on the placement of the seed nodes and the topology of the communities may cause a significant percentage of nodes to have random assignments (coverage).  Making the communities representative of key attribute-value profiles  Obtaining ‘ground truth’.
  • 12. Acknowledgements / References References [1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014. [2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science Review 7, 1-34 (2013). [3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc. SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA. [4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp. 1000. Acknowledgements This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).
  • 13. Thank you for your attention !