Studying user footprints in different online social networks

Studying User Footprints in Diﬀerent Online Social Networks

Studying User Footprints in
Diﬀerent Online Social Networks
Anshu Malhotra 1 , Luam Totti2 , Wagner Meira Jr.2 ,
Ponnurangam Kumaraguru 1 , Virg´ Almeida2
ılio
1

Indraprastha Institute of Information Technology
New Delhi, India
2

Universidade Federal de Minas Gerais
Belo Horizonte, Brazil

August, 2012


Online Digital Footprints

Users commonly register and access accounts (some times
several) on multiple and diverse online services like
Facebook, LinkedIn, Twitter and Youtube
The set of all information related to the user, either provided
directly or observed from the user’s interaction, is often called
the user’s online digital footprint [6]


Linking User’s Online Accounts

To create a user’s digital footprint the user’s multiple
accounts must be known.
We call this process linking user’s online accounts.


Linking User’s Online Accounts
Linking user accounts from diﬀerent services can serve several
purposes [1, 4, 10, 11, 12, 13]:
Centralize user information, enforcing data consistency and
simplifying account maintenance
Enrich recommendation systems
Cross-system personalization
Enable cross-system characterization and pattern analysis
Assess and possibly prevent unwanted information leakage,
thereby protecting users from various privacy and security
threats


Main Challenges
Users may choose different (and unrelated) usernames on
different services, which may be unrelated to their real names
[5]
People with common names tend to have similar usernames
[8, 17]
Users may enter inconsistent and misleading information
across their profiles [5], unintentionally or often deliberately in
order to preserve privacy
Heterogeneity in the network structure and profile fields
among the services


Existing Techniques
Various techniques have been proposed for unifying /
disambiguating users’ various profiles across different online
services:
Techniques based on FOAF ontology & graphs [4, 9, 10, 11]
Techniques based on user generated tags [5, 12, 13]
Techniques based on usernames [8, 17]
Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]


Limitations of Existing Techniques

Specificity to certain types of social networks
Dependency on identifiers like email IDs, Instant Messenger
IDs which might not be publicly available
Use of simple text matching algorithms for comparing
complex profile fields
Manual and experimental assignment of weights and
thresholds which can be subjective and not scalable


Limitations of Existing Techniques

Use of small datasets (biggest being 5,000 users), wherein
the data collection and evaluation was done manually in
some approaches
Real world evaluation has not been done for most of the
techniques
Computationally expensive


Major Contributions of our Work

An scalable supervised learning approach for linking users’
accounts from different services
Evaluation of different context specific similarity metrics
for comparing different profile fields
Results using a large dataset of linking user accounts across
Twitter and LinkedIn
Evaluation of the system’s performance for discovering new
accounts for a given user.


User Profile Disambiguation - System Architecture
Account Correlation Extractor: collates the dataset of user
profiles known to be belonging to the same user across
different social networks
Profile Crawler crawls the public profile information from
user accounts for these services
A user’s Online Digital Footprints are generated after Feature
Extraction and Selection
Various Classifiers are trained for account pairs belonging to
the same users and pairs belonging to different users, which
are then used to disambiguate user profiles i.e. classify the
given input profile pairs to be belonging to the same user or
not


User Proﬁle Disambiguation - System Architecture


Dataset Collection - 1st Stage
The training and testing dataset was collated from two types
of sources:
Social Aggregators: Services that allow users to specify their
multiple accounts in order to create an uniﬁed feed. We
crawled 883,668 users from FriendFeed1 and 38,755 users from
Proﬁlactic.2
Social Graph:3 API that constructs and provides social
interactions data, including information about users accounts
on multiple services. Of the 14 million users collected, only 3.9
million had useful information.
1

http://friendfeed.com/
http://www.profilactic.com/
3
http://code.google.com/apis/socialgraph/docs/
2


Dataset Collection - Example

twitter,justinbieber youtube,kidrauhl youtube,justinbieber
twitter,aplusk facebook,ashton
youtube,felipeneto youtube,felipenetovlog
youtube,maspoxavida twitter,pecesiqueira
youtube,jp youtube,MysteryGuitarMan
youtube,descealetra twitter,cauemoura
twitter,jcillpam11six facebook,jeremiahcillpam11six
twitter,cjdances facebook,cjperrydances
twitter,sirhilton facebook,richardsrueda


Dataset Collection - 2nd Stage

Publicly available information on each account was then
collected from each service
Four services were initially chosen for the analysis
Twitter, LinkedIn, YouTube and Flickr

Six proﬁle ﬁelds were chosen for the analysis
UserID, display name, description, location, connections, image


Dataset Collection
Due to high percentage of missing ﬁelds, YouTube and Flickr
were excluded
All further analysis in this work refers only to Twitter and
LinkedIn accounts
80
70
Missing (%)

60

Twitter
LinkedIn
YouTube
Flickr

50
40
30
20
10
0
Name

Location Description

Image


Profile Similarity

A profile may be seen as a N-dimensional vector, where each
component is a profile field [14]
Therefore, comparing profiles can be done component-wise
However, components are of very distinct nature and hence
demand different similarity methods for comparison
In this work we evaluated different approaches for comparing
each profile field


Similarity Metrics
UserID & Display Name:
Jaro-Winkler[15] distance (JW ) is best suited for similarity
between small strings, hence it was used for both these fields

Description (desc): The fields had punctuation and stop
words removed. The words were then lemmatized and
converted to lower case to produce the final token set.
TF-IDF: Cosine similarity between the two token sets using
their tf-idf vector space representation
Jaccard (Jacc): Jaccard’s similarity score between the two
token sets
Ontology (Ont): Wu-Palmer [16] similarity distance between
the Wordnet based ontologies of each description field


Similarity Metrics
Location (Loc): Tokens were extracted from the location
fields of both profiles by removing the punctuations and
converting them to lower case.
Sub-string (Substr ): Normalized score of number of tokens
from one field value present as a substring of the other field
value
Geo-distance (Geo): Euclidean Distance between the two
locations using their latitude and longitudes (Google Maps
GeoCoding API4 ).
Jaro-Winkler distance
Jaccard’s score

4

https://developers.google.com/maps


Similarity Metrics
Profile Image (img ):
The profile image was downloaded and stored locally
Each image was then scaled down to 48 X 48 pixels using
cubic spline interpolation
Each image was then converted to gray scale.
Each image could then be represented as a vector of values
from 0 to 255 to which functions for computing Mean Square
Error (mse), Peak Signal-to-Noise Ratio (psnr), and
Levenshtein (ls) were applied to quantify profile image
similarity


Similarity Metrics
Number Of Connections (conn): For Twitter the number of
connections of a user u is the number of users that u follows.
For LinkedIn is the number of users in the private network of
user u
The number of connections in different services can assume
different ranges, with different meanings
Normalized (norm): Each connection value c was normalized
to the range [0..1] using the smallest and greatest connection
values observed in each service. The similarity was then taken
as the unsigned difference between the two values.
Class: Each value was assigned a (equally sized) class denoting
how big it was. Five classes were used in this work (0-4). The
similarity was taken as the different between the two classes.


Evaluation Experiments
Feature Analysis: To analyze the discriminative capacity of
different profile attributes and similarity metrics for
disambiguating user profiles
Matching Profiles: To test the effectiveness of supervised
learning approaches for classifying two profiles as belonging
the same user
Discovering Candidate Profiles: To evaluate the
performance of our framework for discovering new accounts
for a given known user
The analysis was done using a dataset of account pairs of
29,129 unique users


Feature Analysis
UserID
IG
Relief
MDL
Gini

Name

JW

JW

Jacc

Description
TF-IDF

Ont

Norm

Connections
Class

0.548
0.434
0.379
0.151

0.812
0.521
0.562
0.217

0.286
0.134
0.274
0.084

0.323
0.180
0.300
0.092

0.161
0.113
0.188
0.051

0.000
0.002
-0.006
0.000

0.009
0.095
0.006
0.003

Location

Image

JW
IG
Relief
MDL
Gini

Jacc

Substr

Geo

MSE

PSNR

LS

0.232
0.108
0.158
0.067

0.337
0.041
0.233
0.098

0.350
0.039
0.270
0.102

0.520
0.227
0.488
0.146

0.183
0.157
0.205
0.051

0.184
0.158
0.205
0.051

0.215
0.188
0.227
0.061

Table : Discriminative capacity of each pair < feature, metric > according to
four diﬀerent approaches.


Feature Analysis - Box Plots
userid jw

name jw

location jw

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0
Match

Non Match

0
Match

location jaccard

Non Match

Match

location substring

Non Match

location geo
300

1

1

0.8

0.8

0.6

0.6

0.4

0.4

100

0.2

0.2

50

0

0
Match

Non Match

250
200
150

0
Match

Non Match

Match

Non Match

Figure : Box plots for the UserID, Name and Location features.


Feature Analysis - Box Plots
description jaccard

description tf-idf

0.5

1

0.4

0.8

0.3

0.6

0.2

0.4

0.1

description ontology

0.2

0

0.1

0

0
Match

Non Match

Match

connections class

Non Match

Match

image psnr

5

30

4

image levenshtein

25

3

Non Match

20

1

15

2

0.9

10

1

5

0

0
Match

Non Match

0.8
Match

Non Match

Match

Non Match

Figure : Box plots for the Description, Connections and Image features.


Matching User Profiles
The most promising similarity metrics and features were used
to train classifiers for the task of detecting profiles that belong
to the same user
Similarity Vector: E.g.: <useridjw , descjaccard , locgeo >
Training Set:
Positive Examples: Similarity vectors for the accounts pairs of
the dataset
Negative Examples: Equal number of negative examples
synthesized by randomly pairing accounts from different users
and calculating their similarity vectors
A total of 58,258 training instances



After training the classifiers were tested with Twitter-LinkedIn
profile pairs to be classified as a “Match” or a “Non Match”
A “Match” means that the two given input profiles belong to
the same user, while “Non Match” means they don’t
Classifiers used:
Na¨ Bayes
ıve
Decision Tree
SVM
kNN


Results were generated for all possible combinations of
profile features and similarity metrics using 10-fold cross
validation.
As shown below, we achieved accuracy, precision and recall as
98%, 99% and 96% respectively for the best feature set

Na¨ Bayes
ıve
Decision Tree
SVM
kNN

Accuracy
0.980
0.965
0.972
0.898

Precision
0.996
0.994
0.988
0.998

Recall
0.964
0.936
0.956
0.798

F1
0.980
0.964
0.971
0.887

Table : Results for multiple classifiers using the feature set
{namejw , useridjw , locgeo , desctfidf , imgls , connnorm }.


Discovering New User Profiles
So the results are very good using an static dataset, but what
if we don’t have candidates to match to a known user
profile?
A system was developed for retrieving profile candidates of
possible matches for a known account from some other
service.
A part (one fifth) of the true positive data was reserved to be
the testing set
A Na¨ Bayes classifier was trained with the remaining set
ıve
and was modified to return the probability of the two input
profiles of belonging to the same user


Discovering Candidate User Profiles
We query Twitter’s API using LinkedIn’s display name for
each profile pair from the testing dataset
For each of the profiles returned from the Twitter API, we
compute the similarity vector with the LinkedIn profile of the
user
We next used the trained classifier to return the probability of
each of these profiles of belonging to the same user
We rank the Twitter profiles in decreasing order of their
probabilities
Ideally the correct Twitter profile (of the profile pair from the
testing set) should be at the top of this ranking


80
Profiles Found (%)

75
70
65
60
55
50
All features
Best Features

45
40

5

10

15

20

Rank position

Figure : Relation between the position in the rank r and the percentage
of times the right proﬁle is found in a position lower or equal to r .


In 64% of the cases the right profile was found in the first
position of the rank when using all features, while this value
was 49% for the set of the best features
This means that using all features instead of only the best can
help to disambiguate between the possible candidates.
In 75% of the times the right profile was in the first 3
positions of the rank
This suggests the system can be used in a semi-supervised
manner


Conclusions & Results

Applied automated techniques to identify accounts beloging
to a same user in diﬀerent online services
Only publicly available information was extracted and used
Proposed and evaluated multiple similarity metrics, comparing
their discriminative capacity for the task proﬁle linking
UserID and Name, when compared using the Jaro-Winkler
metric, were the most discriminative features


Conclusions & Results
For the best set of features and similarity metrics we achieved
accuracy, precision and recall as 98%, 99% and 96%
respectively
Evaluation of the system’s performance for discovering the
user’s profile on Twitter given his display name on LinkedIn
Using all features instead of only the most discriminative ones
has shown better results
The system may be used to match user profile automatically
with 64% accuracy or in a semi supervised manner to narrow
down candidate profiles


Future Work

Incorporate more profile fields
Generalize our model to include other social networks
Adapt our system to handle missing and incorrect profile
attributes


For any further information, please write to
pk@iiitd.ac.in
precog.iiitd.edu.in


Bibliography I
Carmagnola, F., and Cena, F.
User identification for cross-system personalisation.
Inf. Sci. 179, 1-2 (Jan. 2009), 16–32.
Carmagnola, F., Osborne, F., and Torre, I.
Cross-systems identification of users in the social web.
In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134.
Carmagnola, F., Osborne, F., and Torre, I.
User data distributed on the social web: how to identify users on different social
systems and collecting data about them.
In Proceedings of the 1st International Workshop on Information Heterogeneity
and Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10,
ACM, pp. 9–15.
Golbeck, J., and Rothstein, M.
Linking social networks on the web with foaf: a semantic web case study.
AAAI’08, pp. 1138–1143.
Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K.
Identifying users across social tagging systems, 2011.


Bibliography II
Irani, D., Webb, S., Li, K., and Pu, C.
Large online social footprints–an emerging threat.
In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276.
Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E.
Detecting social network proﬁle cloning.
In PerCom (march 2011), pp. 295 –300.
ˆ
Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P.
How unique and traceable are usernames?
In PETS (2011), pp. 1–17.
Rowe, M.
Applying semantic social graphs to disambiguate identity references.
In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009),
pp. 461–475.
Rowe, M.
Interlinking distributed social graphs.
In LDOW2009 (April,Spring 2009).


Bibliography III
Rowe, M., and Ciravegna, F.
Harnessing the social web: The science of identity disambiguation.
In Web Science Conference (2010).
Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N.
Semantic modelling of user interests based on cross-folksonomy analysis.
ISWC ’08, pp. 632–648.
Szomszor, M. N., Cantador, I., and Alani, H.
Correlating user proﬁles from multiple folksonomies.
HT ’08, pp. 33–42.
Vosecky, J., Hong, D., and Shen, V. Y.
User identiﬁcation across multiple social networks.
In NDT’09 (July 2009).
Winkler, W. E.
String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage.
In Proceedings of the Section on Survey Research (1990), pp. 354–359.


Bibliography IV

Wu, Z., and Palmer, M.
Verbs semantics and lexical selection.
In Proceedings of the 32nd annual meeting on Association for Computational
Linguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association for
Computational Linguistics, pp. 133–138.
Zafarani, R., and Liu, H.
Connecting corresponding identities across communities, 2009.

Studying user footprints in different online social networks

More Related Content

What's hot (19)

Similar to Studying user footprints in different online social networks (20)

More from IIIT Hyderabad (20)

Recently uploaded (20)

Studying user footprints in different online social networks