SlideShare a Scribd company logo
Brave New Task: User Account
               Matching

Pisa – October 5, 2012


                                          C. Hauff, G. Friedland
                                                   TU Delft & ICSI




                         MediaEval 2012                              1
Users on the Social Web




     “Cooperative” users: publicly
   provide their respective accounts.

Not just one account, but many accounts.
                       MediaEval 2012      2
User Account Matching                                   time




                                  Can we identify the
   Social Web Stream                  same user in
                                  another social Web
                                   stream universe?



                                  Social Web Stream



                 MediaEval 2012                         3
User Account Matching
                                         1 vs. k
• Different scenarios



            1 vs. 1                                additional
                                                    evidence




 Our task setup.




                        MediaEval 2012               4
Example Recovery Questions
Why?
                                                             What is your favorite team?

• Benevolent uses                                           What is your favorite movie?

  • Enriched user models                              What is your favorite TV program?
  • Improved personalization
                                                    What is your least favorite nickname?
    effectiveness
  • To make users happier                                   What is your favorite sport?

                                                          Who was your childhood hero?
• Malicious uses
  • Password recovery (self-service             What was the first concert you attended?
    password reset) on a large scale                What time of the day were you born?
  • Discover “offline” information
    based on enriched profiles (e.g.                What was your dream job as a child?

    phone numbers)                          What is the middle name of your oldest child?

                                                      Source: http://goodsecurityquestions.com

                           MediaEval 2012                                         5
Existing work with a strong text-based bias
• Most previous work based directly on the profile
  information




                       MediaEval 2012                6
Existing work profile-information based
• Zafarani et al., 2009
   • Similarity of user names on different
     platforms
   • Automatic matching ground truth:
     BlogCatalog (cooperative users)


• Abel et al., 2010
   • Investigated the amount of user profile
     aggregation possible with cross-
     community linking (cross-links
     retrieved from the Social Graph API)

                Source: Abel et al., Interweaving Public User Proles on the Web, UMAP 2010

                              MediaEval 2012                                   7
Existing work beyond profiles
• Narayanan et al., 2009
   • Rely on the graph structure of social networks to de-anonymize the
     graph (no use of profile or content information)

• Iofciu et al., 2011
   • Used tags (StumbleUpon, Flickr, Delicious) of images and
     bookmarks to identify matching accounts
   • Ground truth based on the Social Graph API
   • Content-based matching (compared to user name matching) is a
     much more difficult task
      • Starting point for our work




                                MediaEval 2012                     8
Our task (1 vs. 1)


Assuming a set of uncooperative users, i.e. users that cannot
     be linked according to their self-reported profile
   information, to what extent is it still possible to determine
                          matches?

 Given a Flickr account …


                                  determine the corresponding
                                  Twitter account from a large set
                                  of potential streams.

                         MediaEval 2012                      9
Profile meta-data
Data Set: The Basics                                       is removed.

• 50,000 semi-random users selected on Twitter and
  followed for three months (04/2012-06/2012)
  • ~18,000 tweeted at least once in that time period

• Manually checked potential matching Flickr accounts
  • Potential matches: (i) tweets containing flickr.com,         (ii)
    existing Flickr account with the same user name




                           MediaEval 2012                         10
Data Set: User Distribution
                                   200 photos
                                   (Flickr limit)


    more tweets
    than images




                                                    more images
                                                    than tweets




                  MediaEval 2012                                  11
Data Set: The Temporal Dimension
                                   119 account pairs with
                                   overlapping time stamps




        No information available

                 MediaEval 2012                      12
Baseline
• Treat all tweets of a user as document
   • Corpus of Twitter user documents

• Treat all textual information from a user’s Flickr stream as
  a (very long) query

• Rank the documents with respect to the query
  (i.e. rank the Twitter accounts)


This is a standard ad hoc retrieval problem: we used Okapi.



                          MediaEval 2012                  13
Baseline Results                      RR=
                                                    1
                                          rank(matchingAccount)




  Account matching based on content is hard.


  The larger the number of Flickr images, the
              better the matching.


                     MediaEval 2012                    14
Baseline Results: Taking a Closer Look
• Distribution of the 233 RR values
180
160
140
120                                       Task is either very easy or
                                         very difficult. Less than 20%
100
 80
 60                                      of „queries‟ with non-0/1 RR.
 40
 20
  0
        0          1         Other




• Influence of time overlap in MRR                 Time overlap in
  119 account pairs with overlap      0.2134     streams makes the
  114 account pairs without overlap   0.1253         task easier!

                             MediaEval 2012                     15
Challenges
① Social networks have (strong) data gathering
  restrictions in place – requires long term setup
   • Twitter: complete user history not available
   • Flickr: max. 200 photos for users without “pro” accounts

② Users use different social networks at different time
  periods – makes the matching even more difficult

③ Automatic ground truth generation is error-prone:
  self-reported links can be arbitrary, link to friends, etc.
   •   Crowd-sourcing may be an option

④ Many encountered matches are not from private
  individuals, but belong to organizations or businesses
                            MediaEval 2012                      16
Thank you!

All suggestions are welcome!




          MediaEval 2012       17

More Related Content

PPTX
14 10 21_презентация сту
PPTX
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
PDF
The L2F Spoken Web Search system for Mediaeval 2012
PPT
10 ρ. δρακουλησ
PPTX
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
PPTX
How Spatial Segmentation improves the Multimodal Geo-Tagging
PPT
Ghent and Cardiff University at the 2012 Placing Task
PPT
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
14 10 21_презентация сту
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
The L2F Spoken Web Search system for Mediaeval 2012
10 ρ. δρακουλησ
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
How Spatial Segmentation improves the Multimodal Geo-Tagging
Ghent and Cardiff University at the 2012 Placing Task
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...

Viewers also liked (19)

PPT
Activities for journalistic skills
PPTX
14 10 21_презентация сту
PPTX
Papiloma humano
PPT
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
PDF
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
PPTX
Designinteração– veda 3
PPTX
2010 Marketing Plan
PDF
Intro totransportphenomenanew
PPTX
Como hacer una pagina web en wix sharon
PPTX
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
PDF
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
PPTX
6dicas– veda 4
PPT
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
PDF
GTTS System for the Spoken Web Search Task at MediaEval 2012
PPTX
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
DOCX
Core companies for eee
PPTX
Mr. & Mrs. S Before & After
PPTX
National publishing company-Titli case
Activities for journalistic skills
14 10 21_презентация сту
Papiloma humano
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
Designinteração– veda 3
2010 Marketing Plan
Intro totransportphenomenanew
Como hacer una pagina web en wix sharon
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
6dicas– veda 4
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
GTTS System for the Spoken Web Search Task at MediaEval 2012
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
Core companies for eee
Mr. & Mrs. S Before & After
National publishing company-Titli case
Ad

Similar to Brave New Task: User Account Matching (20)

KEY
Lecture 5: Mining, Analysis and Visualisation
PDF
Lecture4 Social Web
PPTX
VU University Amsterdam - The Social Web 2016 - Lecture 2
PDF
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
PPTX
Learning registry for Connected Educators
PPTX
Social Media Principles for Enterprise Knowledge Management by Augustine Fou
PDF
ASA conference Feb 2013
PPTX
ePortfolios in 2012 (according to Don) - CAPLA version
PDF
Lecture 5: Personalization on the Social Web (2014)
PDF
Global lodlam_communities and open cultural data
PPTX
Breaking Down Walls in Enterprise with Social Semantics
PDF
TruSIS: Trust Accross Social Network
PPTX
VU University Amsterdam - The Social Web 2016 - Lecture 5
PDF
Linked Data for the Masses: The approach and the Software
PPTX
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
ODP
Appleseed Social Networking
PPTX
Choosing the right crowd. Expert finding in social networks. edbt 2013
PDF
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
PPTX
Enterprise 2.0 with Open Source Frameworks like Agorava
PDF
Conversations in Context: A Twitter Case for Social Media Systems Design
Lecture 5: Mining, Analysis and Visualisation
Lecture4 Social Web
VU University Amsterdam - The Social Web 2016 - Lecture 2
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
Learning registry for Connected Educators
Social Media Principles for Enterprise Knowledge Management by Augustine Fou
ASA conference Feb 2013
ePortfolios in 2012 (according to Don) - CAPLA version
Lecture 5: Personalization on the Social Web (2014)
Global lodlam_communities and open cultural data
Breaking Down Walls in Enterprise with Social Semantics
TruSIS: Trust Accross Social Network
VU University Amsterdam - The Social Web 2016 - Lecture 5
Linked Data for the Masses: The approach and the Software
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
Appleseed Social Networking
Choosing the right crowd. Expert finding in social networks. edbt 2013
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Enterprise 2.0 with Open Source Frameworks like Agorava
Conversations in Context: A Twitter Case for Social Media Systems Design
Ad

More from MediaEval2012 (20)

PDF
MediaEval 2012 Opening
PDF
Closing
PDF
A Multimodal Approach for Video Geocoding
PPTX
Brave New Task: Musiclef Multimodal Music Tagging
PDF
Search and Hyperlinking Task at MediaEval 2012
PDF
CUNI at MediaEval 2012: Search and Hyperlinking Task
PDF
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
PPTX
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
PDF
The CLEF Initiative From 2010 to 2012 and Onwards
PPT
Overview of MediaEval 2012 Visual Privacy Task
PPT
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
PPTX
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
PPTX
mevd2012 esra_
PPTX
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
PPT
The MediaEval 2012 Affect Task: Violent Scenes Detectio
PDF
LIG at MediaEval 2012 affect task: use of a generic method
PPT
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
PDF
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
PPT
ARF @ MediaEval 2012: Multimodal Video Classification
PPT
Overview of the MediaEval 2012 Tagging Task
MediaEval 2012 Opening
Closing
A Multimodal Approach for Video Geocoding
Brave New Task: Musiclef Multimodal Music Tagging
Search and Hyperlinking Task at MediaEval 2012
CUNI at MediaEval 2012: Search and Hyperlinking Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
The CLEF Initiative From 2010 to 2012 and Onwards
Overview of MediaEval 2012 Visual Privacy Task
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
mevd2012 esra_
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
The MediaEval 2012 Affect Task: Violent Scenes Detectio
LIG at MediaEval 2012 affect task: use of a generic method
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
ARF @ MediaEval 2012: Multimodal Video Classification
Overview of the MediaEval 2012 Tagging Task

Brave New Task: User Account Matching

  • 1. Brave New Task: User Account Matching Pisa – October 5, 2012 C. Hauff, G. Friedland TU Delft & ICSI MediaEval 2012 1
  • 2. Users on the Social Web “Cooperative” users: publicly provide their respective accounts. Not just one account, but many accounts. MediaEval 2012 2
  • 3. User Account Matching time Can we identify the Social Web Stream same user in another social Web stream universe? Social Web Stream MediaEval 2012 3
  • 4. User Account Matching 1 vs. k • Different scenarios 1 vs. 1 additional evidence Our task setup. MediaEval 2012 4
  • 5. Example Recovery Questions Why?  What is your favorite team? • Benevolent uses  What is your favorite movie? • Enriched user models  What is your favorite TV program? • Improved personalization What is your least favorite nickname? effectiveness • To make users happier   What is your favorite sport? Who was your childhood hero? • Malicious uses • Password recovery (self-service What was the first concert you attended? password reset) on a large scale What time of the day were you born? • Discover “offline” information based on enriched profiles (e.g. What was your dream job as a child? phone numbers) What is the middle name of your oldest child? Source: http://goodsecurityquestions.com MediaEval 2012 5
  • 6. Existing work with a strong text-based bias • Most previous work based directly on the profile information MediaEval 2012 6
  • 7. Existing work profile-information based • Zafarani et al., 2009 • Similarity of user names on different platforms • Automatic matching ground truth: BlogCatalog (cooperative users) • Abel et al., 2010 • Investigated the amount of user profile aggregation possible with cross- community linking (cross-links retrieved from the Social Graph API) Source: Abel et al., Interweaving Public User Proles on the Web, UMAP 2010 MediaEval 2012 7
  • 8. Existing work beyond profiles • Narayanan et al., 2009 • Rely on the graph structure of social networks to de-anonymize the graph (no use of profile or content information) • Iofciu et al., 2011 • Used tags (StumbleUpon, Flickr, Delicious) of images and bookmarks to identify matching accounts • Ground truth based on the Social Graph API • Content-based matching (compared to user name matching) is a much more difficult task • Starting point for our work MediaEval 2012 8
  • 9. Our task (1 vs. 1) Assuming a set of uncooperative users, i.e. users that cannot be linked according to their self-reported profile information, to what extent is it still possible to determine matches? Given a Flickr account … determine the corresponding Twitter account from a large set of potential streams. MediaEval 2012 9
  • 10. Profile meta-data Data Set: The Basics is removed. • 50,000 semi-random users selected on Twitter and followed for three months (04/2012-06/2012) • ~18,000 tweeted at least once in that time period • Manually checked potential matching Flickr accounts • Potential matches: (i) tweets containing flickr.com, (ii) existing Flickr account with the same user name MediaEval 2012 10
  • 11. Data Set: User Distribution 200 photos (Flickr limit) more tweets than images more images than tweets MediaEval 2012 11
  • 12. Data Set: The Temporal Dimension 119 account pairs with overlapping time stamps No information available MediaEval 2012 12
  • 13. Baseline • Treat all tweets of a user as document • Corpus of Twitter user documents • Treat all textual information from a user’s Flickr stream as a (very long) query • Rank the documents with respect to the query (i.e. rank the Twitter accounts) This is a standard ad hoc retrieval problem: we used Okapi. MediaEval 2012 13
  • 14. Baseline Results RR= 1 rank(matchingAccount) Account matching based on content is hard. The larger the number of Flickr images, the better the matching. MediaEval 2012 14
  • 15. Baseline Results: Taking a Closer Look • Distribution of the 233 RR values 180 160 140 120 Task is either very easy or very difficult. Less than 20% 100 80 60 of „queries‟ with non-0/1 RR. 40 20 0 0 1 Other • Influence of time overlap in MRR Time overlap in 119 account pairs with overlap 0.2134 streams makes the 114 account pairs without overlap 0.1253 task easier! MediaEval 2012 15
  • 16. Challenges ① Social networks have (strong) data gathering restrictions in place – requires long term setup • Twitter: complete user history not available • Flickr: max. 200 photos for users without “pro” accounts ② Users use different social networks at different time periods – makes the matching even more difficult ③ Automatic ground truth generation is error-prone: self-reported links can be arbitrary, link to friends, etc. • Crowd-sourcing may be an option ④ Many encountered matches are not from private individuals, but belong to organizations or businesses MediaEval 2012 16
  • 17. Thank you! All suggestions are welcome! MediaEval 2012 17