SlideShare a Scribd company logo
Studying User Footprints in Different Online Social Networks

Studying User Footprints in
Different Online Social Networks
Anshu Malhotra 1 , Luam Totti2 , Wagner Meira Jr.2 ,
Ponnurangam Kumaraguru 1 , Virg´ Almeida2
ılio
1

Indraprastha Institute of Information Technology
New Delhi, India
2

Universidade Federal de Minas Gerais
Belo Horizonte, Brazil

August, 2012
Studying User Footprints in Different Online Social Networks

Online Digital Footprints

Users commonly register and access accounts (some times
several) on multiple and diverse online services like
Facebook, LinkedIn, Twitter and Youtube
The set of all information related to the user, either provided
directly or observed from the user’s interaction, is often called
the user’s online digital footprint [6]
Studying User Footprints in Different Online Social Networks

Linking User’s Online Accounts

To create a user’s digital footprint the user’s multiple
accounts must be known.
We call this process linking user’s online accounts.
Studying User Footprints in Different Online Social Networks

Linking User’s Online Accounts
Linking user accounts from different services can serve several
purposes [1, 4, 10, 11, 12, 13]:
Centralize user information, enforcing data consistency and
simplifying account maintenance
Enrich recommendation systems
Cross-system personalization
Enable cross-system characterization and pattern analysis
Assess and possibly prevent unwanted information leakage,
thereby protecting users from various privacy and security
threats
Studying User Footprints in Different Online Social Networks

Main Challenges
Users may choose different (and unrelated) usernames on
different services, which may be unrelated to their real names
[5]
People with common names tend to have similar usernames
[8, 17]
Users may enter inconsistent and misleading information
across their profiles [5], unintentionally or often deliberately in
order to preserve privacy
Heterogeneity in the network structure and profile fields
among the services
Studying User Footprints in Different Online Social Networks

Existing Techniques
Various techniques have been proposed for unifying /
disambiguating users’ various profiles across different online
services:
Techniques based on FOAF ontology & graphs [4, 9, 10, 11]
Techniques based on user generated tags [5, 12, 13]
Techniques based on usernames [8, 17]
Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]
Studying User Footprints in Different Online Social Networks

Limitations of Existing Techniques

Specificity to certain types of social networks
Dependency on identifiers like email IDs, Instant Messenger
IDs which might not be publicly available
Use of simple text matching algorithms for comparing
complex profile fields
Manual and experimental assignment of weights and
thresholds which can be subjective and not scalable
Studying User Footprints in Different Online Social Networks

Limitations of Existing Techniques

Use of small datasets (biggest being 5,000 users), wherein
the data collection and evaluation was done manually in
some approaches
Real world evaluation has not been done for most of the
techniques
Computationally expensive
Studying User Footprints in Different Online Social Networks

Major Contributions of our Work

An scalable supervised learning approach for linking users’
accounts from different services
Evaluation of different context specific similarity metrics
for comparing different profile fields
Results using a large dataset of linking user accounts across
Twitter and LinkedIn
Evaluation of the system’s performance for discovering new
accounts for a given user.
Studying User Footprints in Different Online Social Networks

User Profile Disambiguation - System Architecture
Account Correlation Extractor: collates the dataset of user
profiles known to be belonging to the same user across
different social networks
Profile Crawler crawls the public profile information from
user accounts for these services
A user’s Online Digital Footprints are generated after Feature
Extraction and Selection
Various Classifiers are trained for account pairs belonging to
the same users and pairs belonging to different users, which
are then used to disambiguate user profiles i.e. classify the
given input profile pairs to be belonging to the same user or
not
Studying User Footprints in Different Online Social Networks

User Profile Disambiguation - System Architecture
Studying User Footprints in Different Online Social Networks

Dataset Collection - 1st Stage
The training and testing dataset was collated from two types
of sources:
Social Aggregators: Services that allow users to specify their
multiple accounts in order to create an unified feed. We
crawled 883,668 users from FriendFeed1 and 38,755 users from
Profilactic.2
Social Graph:3 API that constructs and provides social
interactions data, including information about users accounts
on multiple services. Of the 14 million users collected, only 3.9
million had useful information.
1

http://friendfeed.com/
http://www.profilactic.com/
3
http://code.google.com/apis/socialgraph/docs/
2
Studying User Footprints in Different Online Social Networks

Dataset Collection - Example

twitter,justinbieber youtube,kidrauhl youtube,justinbieber
twitter,aplusk facebook,ashton
youtube,felipeneto youtube,felipenetovlog
youtube,maspoxavida twitter,pecesiqueira
youtube,jp youtube,MysteryGuitarMan
youtube,descealetra twitter,cauemoura
twitter,jcillpam11six facebook,jeremiahcillpam11six
twitter,cjdances facebook,cjperrydances
twitter,sirhilton facebook,richardsrueda
Studying User Footprints in Different Online Social Networks

Dataset Collection - 2nd Stage

Publicly available information on each account was then
collected from each service
Four services were initially chosen for the analysis
Twitter, LinkedIn, YouTube and Flickr

Six profile fields were chosen for the analysis
UserID, display name, description, location, connections, image
Studying User Footprints in Different Online Social Networks

Dataset Collection
Due to high percentage of missing fields, YouTube and Flickr
were excluded
All further analysis in this work refers only to Twitter and
LinkedIn accounts
80
70
Missing (%)

60

Twitter
LinkedIn
YouTube
Flickr

50
40
30
20
10
0
Name

Location Description

Image
Studying User Footprints in Different Online Social Networks

Profile Similarity

A profile may be seen as a N-dimensional vector, where each
component is a profile field [14]
Therefore, comparing profiles can be done component-wise
However, components are of very distinct nature and hence
demand different similarity methods for comparison
In this work we evaluated different approaches for comparing
each profile field
Studying User Footprints in Different Online Social Networks

Similarity Metrics
UserID & Display Name:
Jaro-Winkler[15] distance (JW ) is best suited for similarity
between small strings, hence it was used for both these fields

Description (desc): The fields had punctuation and stop
words removed. The words were then lemmatized and
converted to lower case to produce the final token set.
TF-IDF: Cosine similarity between the two token sets using
their tf-idf vector space representation
Jaccard (Jacc): Jaccard’s similarity score between the two
token sets
Ontology (Ont): Wu-Palmer [16] similarity distance between
the Wordnet based ontologies of each description field
Studying User Footprints in Different Online Social Networks

Similarity Metrics
Location (Loc): Tokens were extracted from the location
fields of both profiles by removing the punctuations and
converting them to lower case.
Sub-string (Substr ): Normalized score of number of tokens
from one field value present as a substring of the other field
value
Geo-distance (Geo): Euclidean Distance between the two
locations using their latitude and longitudes (Google Maps
GeoCoding API4 ).
Jaro-Winkler distance
Jaccard’s score

4

https://developers.google.com/maps
Studying User Footprints in Different Online Social Networks

Similarity Metrics
Profile Image (img ):
The profile image was downloaded and stored locally
Each image was then scaled down to 48 X 48 pixels using
cubic spline interpolation
Each image was then converted to gray scale.
Each image could then be represented as a vector of values
from 0 to 255 to which functions for computing Mean Square
Error (mse), Peak Signal-to-Noise Ratio (psnr), and
Levenshtein (ls) were applied to quantify profile image
similarity
Studying User Footprints in Different Online Social Networks

Similarity Metrics
Number Of Connections (conn): For Twitter the number of
connections of a user u is the number of users that u follows.
For LinkedIn is the number of users in the private network of
user u
The number of connections in different services can assume
different ranges, with different meanings
Normalized (norm): Each connection value c was normalized
to the range [0..1] using the smallest and greatest connection
values observed in each service. The similarity was then taken
as the unsigned difference between the two values.
Class: Each value was assigned a (equally sized) class denoting
how big it was. Five classes were used in this work (0-4). The
similarity was taken as the different between the two classes.
Studying User Footprints in Different Online Social Networks

Evaluation Experiments
Feature Analysis: To analyze the discriminative capacity of
different profile attributes and similarity metrics for
disambiguating user profiles
Matching Profiles: To test the effectiveness of supervised
learning approaches for classifying two profiles as belonging
the same user
Discovering Candidate Profiles: To evaluate the
performance of our framework for discovering new accounts
for a given known user
The analysis was done using a dataset of account pairs of
29,129 unique users
Studying User Footprints in Different Online Social Networks

Feature Analysis
UserID
IG
Relief
MDL
Gini

Name

JW

JW

Jacc

Description
TF-IDF

Ont

Norm

Connections
Class

0.548
0.434
0.379
0.151

0.812
0.521
0.562
0.217

0.286
0.134
0.274
0.084

0.323
0.180
0.300
0.092

0.161
0.113
0.188
0.051

0.000
0.002
-0.006
0.000

0.009
0.095
0.006
0.003

Location

Image

JW
IG
Relief
MDL
Gini

Jacc

Substr

Geo

MSE

PSNR

LS

0.232
0.108
0.158
0.067

0.337
0.041
0.233
0.098

0.350
0.039
0.270
0.102

0.520
0.227
0.488
0.146

0.183
0.157
0.205
0.051

0.184
0.158
0.205
0.051

0.215
0.188
0.227
0.061

Table : Discriminative capacity of each pair < feature, metric > according to
four different approaches.
Studying User Footprints in Different Online Social Networks

Feature Analysis - Box Plots
userid jw

name jw

location jw

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0
Match

Non Match

0
Match

location jaccard

Non Match

Match

location substring

Non Match

location geo
300

1

1

0.8

0.8

0.6

0.6

0.4

0.4

100

0.2

0.2

50

0

0
Match

Non Match

250
200
150

0
Match

Non Match

Match

Non Match

Figure : Box plots for the UserID, Name and Location features.
Studying User Footprints in Different Online Social Networks

Feature Analysis - Box Plots
description jaccard

description tf-idf

0.5

1

0.4

0.8

0.3

0.6

0.2

0.4

0.1

description ontology

0.2

0

0.1

0

0
Match

Non Match

Match

connections class

Non Match

Match

image psnr

5

30

4

image levenshtein

25

3

Non Match

20

1

15

2

0.9

10

1

5

0

0
Match

Non Match

0.8
Match

Non Match

Match

Non Match

Figure : Box plots for the Description, Connections and Image features.
Studying User Footprints in Different Online Social Networks

Matching User Profiles
The most promising similarity metrics and features were used
to train classifiers for the task of detecting profiles that belong
to the same user
Similarity Vector: E.g.: <useridjw , descjaccard , locgeo >
Training Set:
Positive Examples: Similarity vectors for the accounts pairs of
the dataset
Negative Examples: Equal number of negative examples
synthesized by randomly pairing accounts from different users
and calculating their similarity vectors
A total of 58,258 training instances
Studying User Footprints in Different Online Social Networks

Matching User Profiles

After training the classifiers were tested with Twitter-LinkedIn
profile pairs to be classified as a “Match” or a “Non Match”
A “Match” means that the two given input profiles belong to
the same user, while “Non Match” means they don’t
Classifiers used:
Na¨ Bayes
ıve
Decision Tree
SVM
kNN
Studying User Footprints in Different Online Social Networks

Matching User Profiles
Results were generated for all possible combinations of
profile features and similarity metrics using 10-fold cross
validation.
As shown below, we achieved accuracy, precision and recall as
98%, 99% and 96% respectively for the best feature set

Na¨ Bayes
ıve
Decision Tree
SVM
kNN

Accuracy
0.980
0.965
0.972
0.898

Precision
0.996
0.994
0.988
0.998

Recall
0.964
0.936
0.956
0.798

F1
0.980
0.964
0.971
0.887

Table : Results for multiple classifiers using the feature set
{namejw , useridjw , locgeo , desctfidf , imgls , connnorm }.
Studying User Footprints in Different Online Social Networks

Discovering New User Profiles
So the results are very good using an static dataset, but what
if we don’t have candidates to match to a known user
profile?
A system was developed for retrieving profile candidates of
possible matches for a known account from some other
service.
A part (one fifth) of the true positive data was reserved to be
the testing set
A Na¨ Bayes classifier was trained with the remaining set
ıve
and was modified to return the probability of the two input
profiles of belonging to the same user
Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles
We query Twitter’s API using LinkedIn’s display name for
each profile pair from the testing dataset
For each of the profiles returned from the Twitter API, we
compute the similarity vector with the LinkedIn profile of the
user
We next used the trained classifier to return the probability of
each of these profiles of belonging to the same user
We rank the Twitter profiles in decreasing order of their
probabilities
Ideally the correct Twitter profile (of the profile pair from the
testing set) should be at the top of this ranking
Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles
80
Profiles Found (%)

75
70
65
60
55
50
All features
Best Features

45
40

5

10

15

20

Rank position

Figure : Relation between the position in the rank r and the percentage
of times the right profile is found in a position lower or equal to r .
Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles
In 64% of the cases the right profile was found in the first
position of the rank when using all features, while this value
was 49% for the set of the best features
This means that using all features instead of only the best can
help to disambiguate between the possible candidates.
In 75% of the times the right profile was in the first 3
positions of the rank
This suggests the system can be used in a semi-supervised
manner
Studying User Footprints in Different Online Social Networks

Conclusions & Results

Applied automated techniques to identify accounts beloging
to a same user in different online services
Only publicly available information was extracted and used
Proposed and evaluated multiple similarity metrics, comparing
their discriminative capacity for the task profile linking
UserID and Name, when compared using the Jaro-Winkler
metric, were the most discriminative features
Studying User Footprints in Different Online Social Networks

Conclusions & Results
For the best set of features and similarity metrics we achieved
accuracy, precision and recall as 98%, 99% and 96%
respectively
Evaluation of the system’s performance for discovering the
user’s profile on Twitter given his display name on LinkedIn
Using all features instead of only the most discriminative ones
has shown better results
The system may be used to match user profile automatically
with 64% accuracy or in a semi supervised manner to narrow
down candidate profiles
Studying User Footprints in Different Online Social Networks

Future Work

Incorporate more profile fields
Generalize our model to include other social networks
Adapt our system to handle missing and incorrect profile
attributes
Studying User Footprints in Different Online Social Networks

For any further information, please write to
pk@iiitd.ac.in
precog.iiitd.edu.in
Studying User Footprints in Different Online Social Networks

Bibliography I
Carmagnola, F., and Cena, F.
User identification for cross-system personalisation.
Inf. Sci. 179, 1-2 (Jan. 2009), 16–32.
Carmagnola, F., Osborne, F., and Torre, I.
Cross-systems identification of users in the social web.
In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134.
Carmagnola, F., Osborne, F., and Torre, I.
User data distributed on the social web: how to identify users on different social
systems and collecting data about them.
In Proceedings of the 1st International Workshop on Information Heterogeneity
and Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10,
ACM, pp. 9–15.
Golbeck, J., and Rothstein, M.
Linking social networks on the web with foaf: a semantic web case study.
AAAI’08, pp. 1138–1143.
Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K.
Identifying users across social tagging systems, 2011.
Studying User Footprints in Different Online Social Networks

Bibliography II
Irani, D., Webb, S., Li, K., and Pu, C.
Large online social footprints–an emerging threat.
In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276.
Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E.
Detecting social network profile cloning.
In PerCom (march 2011), pp. 295 –300.
ˆ
Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P.
How unique and traceable are usernames?
In PETS (2011), pp. 1–17.
Rowe, M.
Applying semantic social graphs to disambiguate identity references.
In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009),
pp. 461–475.
Rowe, M.
Interlinking distributed social graphs.
In LDOW2009 (April,Spring 2009).
Studying User Footprints in Different Online Social Networks

Bibliography III
Rowe, M., and Ciravegna, F.
Harnessing the social web: The science of identity disambiguation.
In Web Science Conference (2010).
Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N.
Semantic modelling of user interests based on cross-folksonomy analysis.
ISWC ’08, pp. 632–648.
Szomszor, M. N., Cantador, I., and Alani, H.
Correlating user profiles from multiple folksonomies.
HT ’08, pp. 33–42.
Vosecky, J., Hong, D., and Shen, V. Y.
User identification across multiple social networks.
In NDT’09 (July 2009).
Winkler, W. E.
String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage.
In Proceedings of the Section on Survey Research (1990), pp. 354–359.
Studying User Footprints in Different Online Social Networks

Bibliography IV

Wu, Z., and Palmer, M.
Verbs semantics and lexical selection.
In Proceedings of the 32nd annual meeting on Association for Computational
Linguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association for
Computational Linguistics, pp. 133–138.
Zafarani, R., and Liu, H.
Connecting corresponding identities across communities, 2009.

More Related Content

PDF
05 20275 computational solution...
PPT
Visualization of Relationship between Social Bookmark Users
PPTX
Identifying features in opinion mining via intrinsic and extrinsic domain rel...
PDF
Rule based messege filtering and blacklist management for online social network
PPTX
KnowMe and ShareMe: Understanding Automatically Discovered Personality Trai...
PDF
Domain sensitive recommendation with user-item subgroup analysis
PDF
Scalable recommendation with social contextual information
PDF
Df32676679
05 20275 computational solution...
Visualization of Relationship between Social Bookmark Users
Identifying features in opinion mining via intrinsic and extrinsic domain rel...
Rule based messege filtering and blacklist management for online social network
KnowMe and ShareMe: Understanding Automatically Discovered Personality Trai...
Domain sensitive recommendation with user-item subgroup analysis
Scalable recommendation with social contextual information
Df32676679

What's hot (19)

PDF
Semantic Massage Addressing based on Social Cloud Actor's Interests
PDF
WEB-BASED ONTOLOGY EDITOR ENHANCED BY PROPERTY VALUE EXTRACTION
PDF
IRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
PDF
Aj35198205
PDF
Improvement in the usability of gis based services by
PDF
Content Based Message Filtering For OSNS Using Machine Learning Classifier
PDF
An Anonymous System using Single Sign-on Protocol
PDF
IRJET - Chat-Bot for College Information System using AI
PDF
An automatic filtering task in osn using content based approach
PDF
Towards enhanced user interaction to qualify web resources for higher-layered...
PDF
2009-Social computing-First steps to netviz nirvana
PPTX
A system to filter unwanted messages from OSN user walls
PDF
USER PROFILE BASED PERSONALIZED WEB SEARCH
PDF
IRJET- Suspicious Email Detection System
PDF
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
PDF
IJSRED-V2I2P09
PDF
Iaetsd efficient filteration of unwanted messages
PDF
MANAGING THE INTERTWINING AMONG USERS, ROLES, PERMISSIONS, AND USER RELATIONS...
PDF
Ijcatr04041017
Semantic Massage Addressing based on Social Cloud Actor's Interests
WEB-BASED ONTOLOGY EDITOR ENHANCED BY PROPERTY VALUE EXTRACTION
IRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
Aj35198205
Improvement in the usability of gis based services by
Content Based Message Filtering For OSNS Using Machine Learning Classifier
An Anonymous System using Single Sign-on Protocol
IRJET - Chat-Bot for College Information System using AI
An automatic filtering task in osn using content based approach
Towards enhanced user interaction to qualify web resources for higher-layered...
2009-Social computing-First steps to netviz nirvana
A system to filter unwanted messages from OSN user walls
USER PROFILE BASED PERSONALIZED WEB SEARCH
IRJET- Suspicious Email Detection System
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
IJSRED-V2I2P09
Iaetsd efficient filteration of unwanted messages
MANAGING THE INTERTWINING AMONG USERS, ROLES, PERMISSIONS, AND USER RELATIONS...
Ijcatr04041017
Ad

Similar to Studying user footprints in different online social networks (20)

PDF
V3 i35
PDF
Identity Resolution across Different Social Networks using Similarity Analysis
PDF
Q046049397
PPTX
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
PDF
Open Data Infrastructures Evaluation Framework using Value Modelling
PDF
Stabilization of Black Cotton Soil with Red Mud and Formulation of Linear Reg...
PDF
kdd2015-feed (1)
PDF
TOWARDS UNIVERSAL RATING OF ONLINE MULTIMEDIA CONTENT
PDF
Recommendation System for Information Services Adapted, Over Terrestrial Digi...
PPTX
OGD new generation infrastructures evaluation based on value models
PDF
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
PDF
Review and analysis of machine learning and soft computing approaches for use...
PDF
Social Friend Overlying Communities Based on Social Network Context
PDF
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
PDF
Identifying ghost users using social media metadata - University College London
PDF
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
PPSX
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
PPTX
PDF
Friend Recommendation on Social Network Site Based on Their Life Style
PDF
Scalable recommendation with social contextual information
V3 i35
Identity Resolution across Different Social Networks using Similarity Analysis
Q046049397
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Open Data Infrastructures Evaluation Framework using Value Modelling
Stabilization of Black Cotton Soil with Red Mud and Formulation of Linear Reg...
kdd2015-feed (1)
TOWARDS UNIVERSAL RATING OF ONLINE MULTIMEDIA CONTENT
Recommendation System for Information Services Adapted, Over Terrestrial Digi...
OGD new generation infrastructures evaluation based on value models
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Review and analysis of machine learning and soft computing approaches for use...
Social Friend Overlying Communities Based on Social Network Context
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
Identifying ghost users using social media metadata - University College London
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Friend Recommendation on Social Network Site Based on Their Life Style
Scalable recommendation with social contextual information
Ad

More from IIIT Hyderabad (20)

PPTX
Responsible & Safe AI at GSFC Univ Vadodra
PDF
It is our choices, Harry, that show what we truly are, far more than our abil...
PDF
It is our choices, Harry, that show what we truly are, far more than our abil...
PPTX
Response & Safe AI at Summer School of AI at IIITH
PPTX
Responsible & Safe AI Systems at ACM India ROCS at IIT Bombay
PPTX
International Collaboration: Experiences, Challenges, Success stories
PDF
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
PDF
Identify, Inspect and Intervene Multimodal Fake News
PDF
#ChatGPT #ResponsibleAI
PDF
Data Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafety
PDF
It is our choices, Harry, that show what we truly are, far more than our abil...
PDF
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
PPTX
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
PPTX
How to Write a (Good) Research Paper
PDF
Data Science for Social Good: #LegalNLP #AlgorithmicBias
PPTX
Social Computing Research in India
PDF
Social Computing Research in India
PDF
Modeling Online User Interactions and their Offline effects on Socio-Technica...
PDF
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
PDF
It is our choices, Harry, that show what we truly are, far more than our abil...
Responsible & Safe AI at GSFC Univ Vadodra
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
Response & Safe AI at Summer School of AI at IIITH
Responsible & Safe AI Systems at ACM India ROCS at IIT Bombay
International Collaboration: Experiences, Challenges, Success stories
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Identify, Inspect and Intervene Multimodal Fake News
#ChatGPT #ResponsibleAI
Data Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafety
It is our choices, Harry, that show what we truly are, far more than our abil...
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
How to Write a (Good) Research Paper
Data Science for Social Good: #LegalNLP #AlgorithmicBias
Social Computing Research in India
Social Computing Research in India
Modeling Online User Interactions and their Offline effects on Socio-Technica...
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
It is our choices, Harry, that show what we truly are, far more than our abil...

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Getting Started with Data Integration: FME Form 101
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
August Patch Tuesday
A comparative analysis of optical character recognition models for extracting...
Group 1 Presentation -Planning and Decision Making .pptx
Spectral efficient network and resource selection model in 5G networks
TLE Review Electricity (Electricity).pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Getting Started with Data Integration: FME Form 101
Univ-Connecticut-ChatGPT-Presentaion.pdf
OMC Textile Division Presentation 2021.pptx
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Machine Learning_overview_presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
August Patch Tuesday

Studying user footprints in different online social networks

  • 1. Studying User Footprints in Different Online Social Networks Studying User Footprints in Different Online Social Networks Anshu Malhotra 1 , Luam Totti2 , Wagner Meira Jr.2 , Ponnurangam Kumaraguru 1 , Virg´ Almeida2 ılio 1 Indraprastha Institute of Information Technology New Delhi, India 2 Universidade Federal de Minas Gerais Belo Horizonte, Brazil August, 2012
  • 2. Studying User Footprints in Different Online Social Networks Online Digital Footprints Users commonly register and access accounts (some times several) on multiple and diverse online services like Facebook, LinkedIn, Twitter and Youtube The set of all information related to the user, either provided directly or observed from the user’s interaction, is often called the user’s online digital footprint [6]
  • 3. Studying User Footprints in Different Online Social Networks Linking User’s Online Accounts To create a user’s digital footprint the user’s multiple accounts must be known. We call this process linking user’s online accounts.
  • 4. Studying User Footprints in Different Online Social Networks Linking User’s Online Accounts Linking user accounts from different services can serve several purposes [1, 4, 10, 11, 12, 13]: Centralize user information, enforcing data consistency and simplifying account maintenance Enrich recommendation systems Cross-system personalization Enable cross-system characterization and pattern analysis Assess and possibly prevent unwanted information leakage, thereby protecting users from various privacy and security threats
  • 5. Studying User Footprints in Different Online Social Networks Main Challenges Users may choose different (and unrelated) usernames on different services, which may be unrelated to their real names [5] People with common names tend to have similar usernames [8, 17] Users may enter inconsistent and misleading information across their profiles [5], unintentionally or often deliberately in order to preserve privacy Heterogeneity in the network structure and profile fields among the services
  • 6. Studying User Footprints in Different Online Social Networks Existing Techniques Various techniques have been proposed for unifying / disambiguating users’ various profiles across different online services: Techniques based on FOAF ontology & graphs [4, 9, 10, 11] Techniques based on user generated tags [5, 12, 13] Techniques based on usernames [8, 17] Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]
  • 7. Studying User Footprints in Different Online Social Networks Limitations of Existing Techniques Specificity to certain types of social networks Dependency on identifiers like email IDs, Instant Messenger IDs which might not be publicly available Use of simple text matching algorithms for comparing complex profile fields Manual and experimental assignment of weights and thresholds which can be subjective and not scalable
  • 8. Studying User Footprints in Different Online Social Networks Limitations of Existing Techniques Use of small datasets (biggest being 5,000 users), wherein the data collection and evaluation was done manually in some approaches Real world evaluation has not been done for most of the techniques Computationally expensive
  • 9. Studying User Footprints in Different Online Social Networks Major Contributions of our Work An scalable supervised learning approach for linking users’ accounts from different services Evaluation of different context specific similarity metrics for comparing different profile fields Results using a large dataset of linking user accounts across Twitter and LinkedIn Evaluation of the system’s performance for discovering new accounts for a given user.
  • 10. Studying User Footprints in Different Online Social Networks User Profile Disambiguation - System Architecture Account Correlation Extractor: collates the dataset of user profiles known to be belonging to the same user across different social networks Profile Crawler crawls the public profile information from user accounts for these services A user’s Online Digital Footprints are generated after Feature Extraction and Selection Various Classifiers are trained for account pairs belonging to the same users and pairs belonging to different users, which are then used to disambiguate user profiles i.e. classify the given input profile pairs to be belonging to the same user or not
  • 11. Studying User Footprints in Different Online Social Networks User Profile Disambiguation - System Architecture
  • 12. Studying User Footprints in Different Online Social Networks Dataset Collection - 1st Stage The training and testing dataset was collated from two types of sources: Social Aggregators: Services that allow users to specify their multiple accounts in order to create an unified feed. We crawled 883,668 users from FriendFeed1 and 38,755 users from Profilactic.2 Social Graph:3 API that constructs and provides social interactions data, including information about users accounts on multiple services. Of the 14 million users collected, only 3.9 million had useful information. 1 http://friendfeed.com/ http://www.profilactic.com/ 3 http://code.google.com/apis/socialgraph/docs/ 2
  • 13. Studying User Footprints in Different Online Social Networks Dataset Collection - Example twitter,justinbieber youtube,kidrauhl youtube,justinbieber twitter,aplusk facebook,ashton youtube,felipeneto youtube,felipenetovlog youtube,maspoxavida twitter,pecesiqueira youtube,jp youtube,MysteryGuitarMan youtube,descealetra twitter,cauemoura twitter,jcillpam11six facebook,jeremiahcillpam11six twitter,cjdances facebook,cjperrydances twitter,sirhilton facebook,richardsrueda
  • 14. Studying User Footprints in Different Online Social Networks Dataset Collection - 2nd Stage Publicly available information on each account was then collected from each service Four services were initially chosen for the analysis Twitter, LinkedIn, YouTube and Flickr Six profile fields were chosen for the analysis UserID, display name, description, location, connections, image
  • 15. Studying User Footprints in Different Online Social Networks Dataset Collection Due to high percentage of missing fields, YouTube and Flickr were excluded All further analysis in this work refers only to Twitter and LinkedIn accounts 80 70 Missing (%) 60 Twitter LinkedIn YouTube Flickr 50 40 30 20 10 0 Name Location Description Image
  • 16. Studying User Footprints in Different Online Social Networks Profile Similarity A profile may be seen as a N-dimensional vector, where each component is a profile field [14] Therefore, comparing profiles can be done component-wise However, components are of very distinct nature and hence demand different similarity methods for comparison In this work we evaluated different approaches for comparing each profile field
  • 17. Studying User Footprints in Different Online Social Networks Similarity Metrics UserID & Display Name: Jaro-Winkler[15] distance (JW ) is best suited for similarity between small strings, hence it was used for both these fields Description (desc): The fields had punctuation and stop words removed. The words were then lemmatized and converted to lower case to produce the final token set. TF-IDF: Cosine similarity between the two token sets using their tf-idf vector space representation Jaccard (Jacc): Jaccard’s similarity score between the two token sets Ontology (Ont): Wu-Palmer [16] similarity distance between the Wordnet based ontologies of each description field
  • 18. Studying User Footprints in Different Online Social Networks Similarity Metrics Location (Loc): Tokens were extracted from the location fields of both profiles by removing the punctuations and converting them to lower case. Sub-string (Substr ): Normalized score of number of tokens from one field value present as a substring of the other field value Geo-distance (Geo): Euclidean Distance between the two locations using their latitude and longitudes (Google Maps GeoCoding API4 ). Jaro-Winkler distance Jaccard’s score 4 https://developers.google.com/maps
  • 19. Studying User Footprints in Different Online Social Networks Similarity Metrics Profile Image (img ): The profile image was downloaded and stored locally Each image was then scaled down to 48 X 48 pixels using cubic spline interpolation Each image was then converted to gray scale. Each image could then be represented as a vector of values from 0 to 255 to which functions for computing Mean Square Error (mse), Peak Signal-to-Noise Ratio (psnr), and Levenshtein (ls) were applied to quantify profile image similarity
  • 20. Studying User Footprints in Different Online Social Networks Similarity Metrics Number Of Connections (conn): For Twitter the number of connections of a user u is the number of users that u follows. For LinkedIn is the number of users in the private network of user u The number of connections in different services can assume different ranges, with different meanings Normalized (norm): Each connection value c was normalized to the range [0..1] using the smallest and greatest connection values observed in each service. The similarity was then taken as the unsigned difference between the two values. Class: Each value was assigned a (equally sized) class denoting how big it was. Five classes were used in this work (0-4). The similarity was taken as the different between the two classes.
  • 21. Studying User Footprints in Different Online Social Networks Evaluation Experiments Feature Analysis: To analyze the discriminative capacity of different profile attributes and similarity metrics for disambiguating user profiles Matching Profiles: To test the effectiveness of supervised learning approaches for classifying two profiles as belonging the same user Discovering Candidate Profiles: To evaluate the performance of our framework for discovering new accounts for a given known user The analysis was done using a dataset of account pairs of 29,129 unique users
  • 22. Studying User Footprints in Different Online Social Networks Feature Analysis UserID IG Relief MDL Gini Name JW JW Jacc Description TF-IDF Ont Norm Connections Class 0.548 0.434 0.379 0.151 0.812 0.521 0.562 0.217 0.286 0.134 0.274 0.084 0.323 0.180 0.300 0.092 0.161 0.113 0.188 0.051 0.000 0.002 -0.006 0.000 0.009 0.095 0.006 0.003 Location Image JW IG Relief MDL Gini Jacc Substr Geo MSE PSNR LS 0.232 0.108 0.158 0.067 0.337 0.041 0.233 0.098 0.350 0.039 0.270 0.102 0.520 0.227 0.488 0.146 0.183 0.157 0.205 0.051 0.184 0.158 0.205 0.051 0.215 0.188 0.227 0.061 Table : Discriminative capacity of each pair < feature, metric > according to four different approaches.
  • 23. Studying User Footprints in Different Online Social Networks Feature Analysis - Box Plots userid jw name jw location jw 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 Match Non Match 0 Match location jaccard Non Match Match location substring Non Match location geo 300 1 1 0.8 0.8 0.6 0.6 0.4 0.4 100 0.2 0.2 50 0 0 Match Non Match 250 200 150 0 Match Non Match Match Non Match Figure : Box plots for the UserID, Name and Location features.
  • 24. Studying User Footprints in Different Online Social Networks Feature Analysis - Box Plots description jaccard description tf-idf 0.5 1 0.4 0.8 0.3 0.6 0.2 0.4 0.1 description ontology 0.2 0 0.1 0 0 Match Non Match Match connections class Non Match Match image psnr 5 30 4 image levenshtein 25 3 Non Match 20 1 15 2 0.9 10 1 5 0 0 Match Non Match 0.8 Match Non Match Match Non Match Figure : Box plots for the Description, Connections and Image features.
  • 25. Studying User Footprints in Different Online Social Networks Matching User Profiles The most promising similarity metrics and features were used to train classifiers for the task of detecting profiles that belong to the same user Similarity Vector: E.g.: <useridjw , descjaccard , locgeo > Training Set: Positive Examples: Similarity vectors for the accounts pairs of the dataset Negative Examples: Equal number of negative examples synthesized by randomly pairing accounts from different users and calculating their similarity vectors A total of 58,258 training instances
  • 26. Studying User Footprints in Different Online Social Networks Matching User Profiles After training the classifiers were tested with Twitter-LinkedIn profile pairs to be classified as a “Match” or a “Non Match” A “Match” means that the two given input profiles belong to the same user, while “Non Match” means they don’t Classifiers used: Na¨ Bayes ıve Decision Tree SVM kNN
  • 27. Studying User Footprints in Different Online Social Networks Matching User Profiles Results were generated for all possible combinations of profile features and similarity metrics using 10-fold cross validation. As shown below, we achieved accuracy, precision and recall as 98%, 99% and 96% respectively for the best feature set Na¨ Bayes ıve Decision Tree SVM kNN Accuracy 0.980 0.965 0.972 0.898 Precision 0.996 0.994 0.988 0.998 Recall 0.964 0.936 0.956 0.798 F1 0.980 0.964 0.971 0.887 Table : Results for multiple classifiers using the feature set {namejw , useridjw , locgeo , desctfidf , imgls , connnorm }.
  • 28. Studying User Footprints in Different Online Social Networks Discovering New User Profiles So the results are very good using an static dataset, but what if we don’t have candidates to match to a known user profile? A system was developed for retrieving profile candidates of possible matches for a known account from some other service. A part (one fifth) of the true positive data was reserved to be the testing set A Na¨ Bayes classifier was trained with the remaining set ıve and was modified to return the probability of the two input profiles of belonging to the same user
  • 29. Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles We query Twitter’s API using LinkedIn’s display name for each profile pair from the testing dataset For each of the profiles returned from the Twitter API, we compute the similarity vector with the LinkedIn profile of the user We next used the trained classifier to return the probability of each of these profiles of belonging to the same user We rank the Twitter profiles in decreasing order of their probabilities Ideally the correct Twitter profile (of the profile pair from the testing set) should be at the top of this ranking
  • 30. Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles 80 Profiles Found (%) 75 70 65 60 55 50 All features Best Features 45 40 5 10 15 20 Rank position Figure : Relation between the position in the rank r and the percentage of times the right profile is found in a position lower or equal to r .
  • 31. Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles In 64% of the cases the right profile was found in the first position of the rank when using all features, while this value was 49% for the set of the best features This means that using all features instead of only the best can help to disambiguate between the possible candidates. In 75% of the times the right profile was in the first 3 positions of the rank This suggests the system can be used in a semi-supervised manner
  • 32. Studying User Footprints in Different Online Social Networks Conclusions & Results Applied automated techniques to identify accounts beloging to a same user in different online services Only publicly available information was extracted and used Proposed and evaluated multiple similarity metrics, comparing their discriminative capacity for the task profile linking UserID and Name, when compared using the Jaro-Winkler metric, were the most discriminative features
  • 33. Studying User Footprints in Different Online Social Networks Conclusions & Results For the best set of features and similarity metrics we achieved accuracy, precision and recall as 98%, 99% and 96% respectively Evaluation of the system’s performance for discovering the user’s profile on Twitter given his display name on LinkedIn Using all features instead of only the most discriminative ones has shown better results The system may be used to match user profile automatically with 64% accuracy or in a semi supervised manner to narrow down candidate profiles
  • 34. Studying User Footprints in Different Online Social Networks Future Work Incorporate more profile fields Generalize our model to include other social networks Adapt our system to handle missing and incorrect profile attributes
  • 35. Studying User Footprints in Different Online Social Networks For any further information, please write to pk@iiitd.ac.in precog.iiitd.edu.in
  • 36. Studying User Footprints in Different Online Social Networks Bibliography I Carmagnola, F., and Cena, F. User identification for cross-system personalisation. Inf. Sci. 179, 1-2 (Jan. 2009), 16–32. Carmagnola, F., Osborne, F., and Torre, I. Cross-systems identification of users in the social web. In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134. Carmagnola, F., Osborne, F., and Torre, I. User data distributed on the social web: how to identify users on different social systems and collecting data about them. In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10, ACM, pp. 9–15. Golbeck, J., and Rothstein, M. Linking social networks on the web with foaf: a semantic web case study. AAAI’08, pp. 1138–1143. Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K. Identifying users across social tagging systems, 2011.
  • 37. Studying User Footprints in Different Online Social Networks Bibliography II Irani, D., Webb, S., Li, K., and Pu, C. Large online social footprints–an emerging threat. In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276. Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E. Detecting social network profile cloning. In PerCom (march 2011), pp. 295 –300. ˆ Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P. How unique and traceable are usernames? In PETS (2011), pp. 1–17. Rowe, M. Applying semantic social graphs to disambiguate identity references. In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009), pp. 461–475. Rowe, M. Interlinking distributed social graphs. In LDOW2009 (April,Spring 2009).
  • 38. Studying User Footprints in Different Online Social Networks Bibliography III Rowe, M., and Ciravegna, F. Harnessing the social web: The science of identity disambiguation. In Web Science Conference (2010). Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N. Semantic modelling of user interests based on cross-folksonomy analysis. ISWC ’08, pp. 632–648. Szomszor, M. N., Cantador, I., and Alani, H. Correlating user profiles from multiple folksonomies. HT ’08, pp. 33–42. Vosecky, J., Hong, D., and Shen, V. Y. User identification across multiple social networks. In NDT’09 (July 2009). Winkler, W. E. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research (1990), pp. 354–359.
  • 39. Studying User Footprints in Different Online Social Networks Bibliography IV Wu, Z., and Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association for Computational Linguistics, pp. 133–138. Zafarani, R., and Liu, H. Connecting corresponding identities across communities, 2009.