Generating synthetic online social network graph data and topologies

3rd Workshop on Graph-based Technologies and
Applications (Graph-TA)
UPC, Barcelona

Introduction
● Motivation: one of the difficulties for data analysts of
online social networks is the public availability of data,
while respecting the privacy of the users.
● Solution: use synthetically generated data [1].
However, this presents a series of challenges related to
generating a realistic dataset in terms of topologies,
attribute values, communities, data distributions, and so
on.

Introduction
 In the following we present an approach for
generating a graph topology and populating it with
synthetic data to simulate an online social network.
 3 Steps
Data Definition
Topology
Generation
Data
Population

Step 1: Topology generation
We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat
employs a statistical approach and a recursive process to replicate the power law
distributions, skew distributions and community structure (which can be hierarchical),
while maintaining a small diameter for the graph.
The adjacency matrix A of a graph of N nodes is an N  N
matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0
otherwise.
The adjacency matrix is recursively subdivided into four
equal-sized partitions, and edges are distributed within
these partitions with a unequal probabilities.
Starting from an empty adjacency matrix, edges are
assigned into the matrix one at a time.
For each edge, one of the four partitions is assigned with
probabilities a, b, c, d respectively (a + b + c + d = 1).
The chosen partition is again subdivided into four smaller
partitions, and the procedure is repeated until a simple
cell is obtained (1  1 partition).
The edge is then assigned to this cell of the adjacency
matrix.
Fig. 1. R-mat model
In the current work, we have used the
following probabilities: a=0.45, b=0.15,
c=0.15, d=0.25

We identify the communities in the
graph structure generated by Rmat, by
processing it with the Louvain[4]
method, which assigns a community
label to each vertex in the graph.
Fig. 2a. Rmat generated topology.
Different colors indicate
communities.

Next we identify a set of seed vertices
which will be used as the starting
points for propagating the data.
• For each community, we find
the medoid node in terms of the
statistical and topological
qualities (especially centrality
and degree).
• We can progressively assign
seed vertices which give an
optimal coverage of the
complete graph, while not
overlapping in their immediate
neighborhoods (to avoid
overwriting data propagated
from different seeds).
Fig. 2b. Rmat generated topology
with seed nodes indicated.

Step 2: Data definition
• The choice of data will be application specific.
• The distributions of the values of the different attributes should be similar to
those of some real social network (ground truth).
• In order to achieve this, we can use sources of official statistics, such as
government census data
 www.indexmundi.com, www.census.gov, www.bls.gov
• Statistical summaries made public by the social network providers, such as
Facebook:
 www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males-
vs-females-in-social-networks/
• We may also need some lookup tables for inter-related attribute values.
 For example, for the age group “18-25”, there will be a much higher proportion of
“profession=student” and “marital status=single”, and for “gender=male” there will be a
higher proportion of “like3=soccer club”.
• Another option for ‘ground truth’ are publicly available OSN datasets,
such as:
 SNAP: http://snap.stanford.edu/data/#communities

Step 2: Data definition
Attribute Values
Age "18-25", (32%) "26-35", (26%) "36-45" (20%) ,"46-55" (13%) ,"56-65" (9%)
Gender male (47%), female (53%)
Residence "Palo Alto“ (17%), "Santa Barbara“ (16%), "Boca Raton“ (16%), "Boston“ (17%), "Norfolk“ (17%),
"San Jose“ (17%)
Religion "Christian" (31.9%), "Hindu" (14.8%), "Jewish" (0.2%), "Muslim" (27.1%), "Sikh" (0.3%),
"Traditional Spirituality" (0.1%),
"Other Religions" (12.9%), "No religious affiliation" (12.7%)
Marital status "Single" (31.5%), "Married" (51.4%), "Divorced" (10.5%), "Widowed" (6.6%)
Profession
(ISCO-08
structure)
"Manager" (12.2%), "Professional" (17.1%), "Service" (13.9%), "Sales and office" (17.8%), “Student”
(23%),
"Natural resources construction and maintenance" (7.0%), "Production transportation and material
moving" (9.0%)
Political
orientation
"Far Left" (9.4%), "Left" (34.7%),"Center Left", (18.1%), "Center" (18.0%), "Center Right" (10.5%),
"Right" (8.0%), "Far Right" (1.3%)
{like1, like2,
like3}
Patterns: {"entertainment", "entertainment", "music artist"} (25%),
{"music artist", "music artist", "entertainment"} (25%),
{"drink brand", "drink brand", "entertainment"} (25%),
{"tv show", "drink brand", "soccer club"} (25%).
Table 1. Example attributes, attribute-values and their proportions.

Step 3: Data population
 In order to optimize the assignment, we can use a fitness function and find the
optimum configuration using a stochastic process.
Fitness = (, , )
where  = set of seed assignments,  = set of data propagation rules and thresholds,  = set
of required data distributions.
 We initiate the data population of the network using the set of seed nodes.
 The rest of the nodes will be assigned data by a propagation from the seed nodes.
 The immediate neighbors of a seed will have a higher probability of being assigned similar
attribute values.
 For each hop further from the seed node, the influence of the seed node on other nodes is
progressively reduced. However, we give precedence to seed nodes in the same community as the
node to be assigned.
 We can also use ontologies /taxonomies and a distance measure to assign similar, rather than
identical values (with an appropriate threshold) when propagating attribute-values.
 The influence on assignment by the seed node has to be traded off by the desired overall
proportions of the attribute values (diversity).

Step 3: Data population
Fig. 3. Data propagation from seed
nodes and topology of Fig. 1.

Challenges
 A high degree node chosen as a seed may
have a disproportionate influence on the
network.
 The synthetic data generation
may create fictitious user
profiles (combinations of
attribute-values).
 The restrictions on the
placement of the seed
nodes and the topology of
the communities may cause
a significant percentage of
nodes to have random
assignments (coverage).
 Making the communities
representative of key
attribute-value profiles
 Obtaining ‘ground truth’.

Acknowledgements / References
References
[1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN
Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014.
[2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science
Review 7, 1-34 (2013).
[3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc.
SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA.
[4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in
large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp.
1000.
Acknowledgements
This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).

Thank you for your
attention !

Generating synthetic online social network graph data and topologies

More Related Content

What's hot (20)

Similar to Generating synthetic online social network graph data and topologies (20)

More from Graph-TA (20)

Recently uploaded (20)

Generating synthetic online social network graph data and topologies