SlideShare a Scribd company logo
See	discussions,	stats,	and	author	profiles	for	this	publication	at:	https://www.researchgate.net/publication/280782699
Presentation	-	Fake	Twitter	accounts:	Profile	characteristics
obtained	using	an	activity-based	pattern	detection
approach
DATASET	·	AUGUST	2015
READS
46
4	AUTHORS,	INCLUDING:
Supraja	Gurajala
5	PUBLICATIONS			8	CITATIONS			
SEE	PROFILE
Joshua	S	White
State	University	of	New	York	Institute	of	Technology	at	…
68	PUBLICATIONS			25	CITATIONS			
SEE	PROFILE
Available	from:	Joshua	S	White
Retrieved	on:	11	April	2016
 
	
  
Fake	
  Twi)er	
  accounts:	
  Profile	
  
characteris6cs	
  	
  obtained	
  using	
  an	
  
ac6vity-­‐based	
  pa)ern	
  detec6on	
  
approach	
  
Supraja	
  Gurajala,	
  Joshua	
  S.	
  White,	
  Brian	
  Hudson,	
  and	
  
Jeanna	
  N.	
  Ma:hews	
  
	
  
Department	
  of	
  Mathema?cs	
  and	
  Computer	
  Science	
  
Clarkson	
  University,	
  Potsdam,	
  NY	
  
gurajas@clarkson.edu	
  
 
	
  
Online	
  Social	
  Networks	
  
•  OSNs	
  are	
  widely	
  used	
  for:	
  
– Personal	
  communica?ons,	
  social	
  interac?ons	
  
– Commercial	
  uses	
  
•  Marke?ng	
  through	
  OSNs	
  
•  Brand	
  growth	
  
– Big	
  Data	
  
•  E.g.	
  Elec?on	
  data	
  analysis	
  
•  Flu	
  tracking	
  
– Our	
  focus	
  -­‐	
  Twi:er	
  
2	
  
 
	
  
Mining	
  Twi:er	
  Data	
  
•  New	
  and	
  novel	
  applica?ons	
  
– Public	
  health	
  
– Personal	
  /Organiza?onal	
  popularity	
  
– Scien?fic	
  dissemina?on	
  
•  Necessitates	
  confidence	
  in	
  user	
  authen?city	
  
– Challenged	
  by	
  cyber-­‐opportunists	
  	
  
•  Sybils	
  –	
  fake	
  profiles	
  created	
  to	
  mimic	
  users	
  
– Some	
  general,	
  some	
  are	
  Iden?ty	
  clone	
  accounts	
  
3	
  
 
	
  
Fake	
  Accounts	
  -­‐	
  problem	
  
•  “Followers	
  for	
  hire”	
  
– Infla?ng	
  follower	
  numbers	
  for	
  other	
  accounts	
  
•  Fake	
  followers	
  
– Bias	
  an	
  individual/organiza?on’s	
  popularity	
  
– Alter	
  characteris?cs	
  of	
  the	
  audience	
  
– Create	
  a	
  legi?macy	
  problem	
  	
  
•  A	
  mul?-­‐million	
  dollar	
  market	
  for	
  buying	
  fake	
  
followers	
  
4	
  
 
	
  
Proposed	
  approach	
  
•  Most	
  of	
  the	
  current	
  Twi:er-­‐based	
  research	
  
has	
  focused	
  on	
  analysis	
  of	
  tweets	
  
•  Twi:er	
  profile-­‐based	
  analysis	
  	
  
– Faster	
  detec?on	
  
– Simpler	
  logis?cs	
  	
  
•  smaller	
  storage,	
  memory	
  required	
  
– More	
  comprehensive	
  detec?on	
  
•  Can	
  also	
  detect	
  accounts	
  that	
  don’t	
  tweet	
  
5	
  
 
	
  
Implementa?on	
  
•  Twi:er	
  user	
  profiles	
  
– Created	
  accounts	
  to	
  access	
  profiles	
  
– Twi:er	
  API	
  	
  
– Breadth	
  first	
  search	
  over	
  a	
  set	
  of	
  seed	
  users	
  
– Currently	
  collected	
  62	
  million	
  user	
  profiles	
  
– JSON	
  (JAVA	
  Script	
  Object	
  Nota?on)	
  format	
  
•  Mongodb	
  
6	
  
 
	
  
Methodology	
  
•  33	
  a:ributes	
  for	
  each	
  profile	
  
•  Manual	
  inspec?on	
  
– Most	
  of	
  these	
  a:ributes	
  are	
  set	
  to	
  default	
  values	
  
•  10	
  Key	
  a:ributes	
  
– name,	
  followers_count,	
  friends_count,	
  verified,	
  
created_at,	
  descrip?on,	
  loca?on,	
  updated,	
  
profile_image_url	
  and	
  screen_name	
  	
  
7	
  
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
NameGroup by
Methodology	
  
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
Group by
Methodology	
  
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
Name + Description + Location
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
6.9 Million accounts
0.7 Million groups
Methodology	
  
Group by
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
6.9 Million accounts
0.7 Million groups
Number of groups: 78
Number of accounts: 804
Name + Description + Location
Methodology	
  
Group by
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
6.9 Million accounts
0.7 Million groups
Number of groups: 78
Number of accounts :804
Screen-name Patterns
Name + Description + Location
Methodology	
  
Group by
 
	
  
Pa:ern	
  Recogni?on	
  
13	
  
Group	
  
 
	
  
Pa:ern	
  Recogni?on	
  
14	
  
Group	
  
 
	
  
Pa:ern	
  Recogni?on	
  
15	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
SE	
  –	
  Shannon	
  Entropy	
  
 
	
  
Pa:ern	
  Recogni?on	
  
16	
  
Name2	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
 
	
  
Pa:ern	
  Recogni?on	
  
17	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
 
	
  
Pa:ern	
  Recogni?on	
  
18	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
 
	
  
Pa:ern	
  Recogni?on	
  
19	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
 
	
  
Pa:ern	
  Recogni?on	
  
20	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Add	
  Name2	
  to	
  current	
  
collec?on	
  
Yes	
  
 
	
  
Pa:ern	
  Recogni?on	
  
21	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Add	
  Name2	
  to	
  current	
  
collec?on	
  
Yes	
  
No	
  Add	
  to	
  group	
  for	
  analysis	
  in	
  next	
  itera?on	
  
 
	
  
Pa:ern	
  Recogni?on	
  
22	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Add	
  Name2	
  to	
  current	
  
collec?on	
  
Yes	
  
No	
  Add	
  to	
  group	
  for	
  analysis	
  in	
  next	
  itera?on	
  
 
	
  
Pa:ern	
  Recogni?on	
  
23	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
SE	
  –	
  Shannon	
  Entropy	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Add	
  Name2	
  to	
  current	
  
collec?on	
  
Yes	
  
No	
  Add	
  to	
  group	
  for	
  analysis	
  in	
  next	
  itera?on	
  
 
	
  
Pa:ern	
  Recogni?on	
  
24	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Yes	
  
No	
  Add	
  to	
  group	
  for	
  analysis	
  in	
  next	
  itera?on	
  
 
	
  
Pa:ern	
  Recogni?on	
  
25	
  
Name2	
  
Concatenate	
  
screen-­‐names	
  
Select	
  Screen-­‐Name	
  
for	
  analysis	
  
Group	
  
Name1	
  
Base	
  Screen_Name	
  
Calculate	
  SE1	
  
Name1Name2	
  
SE2	
  
Is	
  SE2-­‐SE1<0.1	
  
Yes	
  
No	
  Add	
  to	
  group	
  for	
  analysis	
  in	
  next	
  itera?on	
  
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
6.9 Million accounts
0.7 Million groups
8276 patterns
Accounts: 263,250
Number of groups: 78
Number of accounts :804
Screen-name Patterns
e.g. Patterns in a collection list:
freefollow, SavedCompete, etc.
Name + Description + Location
Methodology	
  
Group by
 
	
  
Elimina?on	
  of	
  false	
  posi?ves	
  
•  Iden?fies	
  closely	
  associated	
  accounts	
  
•  False	
  posi?ves	
  –	
  Highly	
  popular	
  names	
  
•  Final	
  filter	
  
– Determine	
  distribu?on	
  of	
  update	
  ?mes	
  
•  Genuine	
  accounts	
  –	
  uniform	
  distribu?on	
  
•  Fake	
  -­‐	
  non	
  uniform	
  
27	
  
 
	
  
28	
  
Update	
  (me	
  occurrences	
  
Chopra	
   free_follow	
  
Distribu?on	
  of	
  update	
  ?mes	
  as	
  a	
  func?on	
  of	
  ?me	
  of	
  day	
  
 
	
  
Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
6.9 Million accounts
0.7 Million groups
8276 patterns
Accounts: 263,250
8873 groups
Accounts: ~56,000
Number of groups: 78
Number of accounts :804
e.g. Patterns in a collection list:
freefollow, SavedCompete, etc.
Update times
Core group of fake accounts:
Same Name, Description, Location, and Update times and
Screen names matching 1 or more patterns
Screen-name Patterns
Name + Description + Location
Methodology	
  
Group by
 
	
  
Analysis	
  of	
  Results	
  
•  Genera?on	
  of	
  Ground	
  Truth	
  data	
  
– Random	
  sample	
  of	
  Twi:er	
  user	
  profiles	
  
– Similar	
  size	
  and	
  ?meline	
  for	
  consistency	
  
•  Fake	
  Profile	
  characteris?cs	
  
– Update	
  Times	
  
– Crea?on	
  Times	
  
– URL	
  Analysis	
  
	
  
30	
  
 
	
  
Update	
  ?mes	
  
31	
  
Ground	
  truth	
   Fake	
  Profiles	
  
 
	
  
Update	
  days	
  
32	
  
Ground	
  truth	
   Fake	
  profiles	
  
 
	
  
Crea?on	
  ?mes	
  
33	
  
Ground	
  truth	
   Fake	
  Profiles	
  
No	
  bias	
  in	
  the	
  crea?on	
  ?mes	
  of	
  ground	
  truth	
  data,	
  unlike	
  the	
  fake	
  profiles	
  	
  
 
	
  
Crea?on	
  days	
  
34	
  
Ground	
  truth	
   Fake	
  profiles	
  
 
	
  
Crea?on	
  ?me	
  distribu?on	
  
•  Fake	
  accounts	
  
are	
  mostly	
  
created	
  in	
  
batches,	
  over	
  
short	
  ?me	
  
intervals	
  
•  Note:	
  these	
  
results	
  are	
  for	
  
~60000	
  accounts	
  
35	
  
 
	
  
Crea?on	
  ?me	
  distribu?on	
  
•  Fake	
  accounts	
  
are	
  mostly	
  
created	
  in	
  
batches,	
  over	
  
short	
  ?me	
  
intervals	
  
•  Note:	
  these	
  
results	
  are	
  for	
  
~60000	
  accounts	
  
36	
  
 
	
  
Crea?on	
  ?me	
  distribu?on	
  
•  Fake	
  accounts	
  
are	
  mostly	
  
created	
  in	
  
batches,	
  over	
  
short	
  ?me	
  
intervals	
  
•  Note:	
  these	
  
results	
  are	
  for	
  
~60000	
  accounts	
  
37	
  
 
	
  
38	
  
Collage	
  of	
  
images	
  from	
  
dis?nct	
  
URLs	
  for	
  a	
  
group	
  of	
  
659	
  
accounts	
  
within	
  a	
  
fake	
  profile	
  
set	
  
 
	
  
Conclusions	
  
•  A	
  large	
  Twi:er	
  profile	
  database	
  was	
  analyzed	
  
– Fake	
  accounts	
  were	
  detected	
  based	
  on	
  their	
  
profile	
  characteris?cs	
  
•  Fake	
  accounts	
  tend	
  to	
  have	
  minimal	
  diversity	
  
in:	
  
– crea?on	
  ?mes	
  
– update	
  ?mes	
  
– image	
  URLs	
  
•  The	
  crea?on	
  of	
  fake	
  accounts	
  is	
  possibly	
  only	
  
semi-­‐automated	
  
39	
  
 
	
  
Future	
  Work	
  
•  Using	
  this	
  core	
  fake	
  data	
  as	
  seed	
  will	
  help	
  us	
  
iden?fy	
  a	
  more	
  comprehensive	
  fake	
  profile	
  set	
  
•  Pa:ern-­‐matching	
  focused	
  Algorithm	
  
•  Understand	
  the	
  intra	
  and	
  inter	
  rela?onship	
  of	
  
these	
  fake	
  accounts.	
  
•  Tweet	
  Analysis	
  -­‐	
  Apache	
  SPARK	
  -­‐	
  GraphX	
  
40	
  
 
	
  
	
  	
  	
  	
  	
  	
  QUESTIONS	
  ?	
  
41	
  

More Related Content

PDF
Sina presentation in IBM
PPTX
Using Graph and Transformer Embeddings for Vector Based Retrieval
PPT
Investigating the Semantic Gap through Query Log Analysis
PPT
Mapping Tweets to Conference Talks: A Goldmine for Semantics
PDF
Mis 510 cyber analytics project report
PPT
(Re-) Discovering Lost Web Pages
PDF
Social Networks analysis to characterize HIV at-risk populations - Progress a...
PPTX
1 3 dodavannja_i_mnozhennja_nerivnostej
Sina presentation in IBM
Using Graph and Transformer Embeddings for Vector Based Retrieval
Investigating the Semantic Gap through Query Log Analysis
Mapping Tweets to Conference Talks: A Goldmine for Semantics
Mis 510 cyber analytics project report
(Re-) Discovering Lost Web Pages
Social Networks analysis to characterize HIV at-risk populations - Progress a...
1 3 dodavannja_i_mnozhennja_nerivnostej

Viewers also liked (14)

PDF
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
PDF
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
PDF
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
PPTX
Tecnicasgrupales semana 1
PDF
Presentación de Due Diligence
PDF
ase-social-informatics (6)
DOCX
Arterias carótidas internas
DOCX
Arteria carotida externa
PDF
Privacy and Security on Online Social Media: Workshop on Data Analytics & Its...
PPSX
Measures that Passed the 27th Legislature and Will Affect Hawaii's Statewide ...
PPTX
Social Media and Privacy
PDF
Big data privacy issues in public social media
PDF
CSIAC - Social Media Analysis and Privacy
PPTX
Universidad técnica luis vargas torres de esmeraldas
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Tecnicasgrupales semana 1
Presentación de Due Diligence
ase-social-informatics (6)
Arterias carótidas internas
Arteria carotida externa
Privacy and Security on Online Social Media: Workshop on Data Analytics & Its...
Measures that Passed the 27th Legislature and Will Affect Hawaii's Statewide ...
Social Media and Privacy
Big data privacy issues in public social media
CSIAC - Social Media Analysis and Privacy
Universidad técnica luis vargas torres de esmeraldas
Ad

Similar to Supraja_SMS_presentation (20)

PDF
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
PDF
Fake accounts detection on social media using stack ensemble system
PDF
IRJET- Fake Profile Identification using Machine Learning
PDF
Improved performance of fake account classifiers with percentage overlap feat...
PDF
CYBERGUARD:FAKE PROFILE DETECTION USING MACHINE LEARNING
PDF
Database Admin for Comp review seminar.pdf
PPTX
MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambigua...
PDF
Week 8.1 Profile Linking on Online Social Media
PDF
IRJET- Identification of Clone Attacks in Social Networking Sites
PDF
Social genome mining for crisis prediction
PPTX
Social Νetworks Data Mining
DOCX
CROSS-PLATFORM IDENTIFICATION OF ANONYMOUS IDENTICAL USERS IN MULTIPLE SOCIAL...
PDF
Mining social data
PPTX
LSS'11: Charting Collections Of Connections In Social Media
PPTX
20111103 con tech2011-marc smith
PPTX
Social network analysis
 
PPTX
Credibility, Identity Resolution, and Privacy on Online Social Media
PPTX
Mining Social Networks, an Introduction and Overview - Andy Pryke
PPTX
Clickstream ppt copy
PDF
IRJET- Competitive Analysis of Attacks on Social Media
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
Fake accounts detection on social media using stack ensemble system
IRJET- Fake Profile Identification using Machine Learning
Improved performance of fake account classifiers with percentage overlap feat...
CYBERGUARD:FAKE PROFILE DETECTION USING MACHINE LEARNING
Database Admin for Comp review seminar.pdf
MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambigua...
Week 8.1 Profile Linking on Online Social Media
IRJET- Identification of Clone Attacks in Social Networking Sites
Social genome mining for crisis prediction
Social Νetworks Data Mining
CROSS-PLATFORM IDENTIFICATION OF ANONYMOUS IDENTICAL USERS IN MULTIPLE SOCIAL...
Mining social data
LSS'11: Charting Collections Of Connections In Social Media
20111103 con tech2011-marc smith
Social network analysis
 
Credibility, Identity Resolution, and Privacy on Online Social Media
Mining Social Networks, an Introduction and Overview - Andy Pryke
Clickstream ppt copy
IRJET- Competitive Analysis of Attacks on Social Media
Ad

More from Joshua S. White, PhD josh@securemind.org (7)

PDF
Social Network Analysis Applications and Approach
ODP
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
PPT
Malware bek slides 20131023 final
PDF
Clarkson - Joshua White - Research Proposal Presentation
PPT
Coalmine spie 2012 presentation - jsw -d3
PPT
Phishing spie 2012 presentation - jsw - d2
PPT
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Social Network Analysis Applications and Approach
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
Malware bek slides 20131023 final
Clarkson - Joshua White - Research Proposal Presentation
Coalmine spie 2012 presentation - jsw -d3
Phishing spie 2012 presentation - jsw - d2
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...

Supraja_SMS_presentation

  • 2.     Fake  Twi)er  accounts:  Profile   characteris6cs    obtained  using  an   ac6vity-­‐based  pa)ern  detec6on   approach   Supraja  Gurajala,  Joshua  S.  White,  Brian  Hudson,  and   Jeanna  N.  Ma:hews     Department  of  Mathema?cs  and  Computer  Science   Clarkson  University,  Potsdam,  NY   gurajas@clarkson.edu  
  • 3.     Online  Social  Networks   •  OSNs  are  widely  used  for:   – Personal  communica?ons,  social  interac?ons   – Commercial  uses   •  Marke?ng  through  OSNs   •  Brand  growth   – Big  Data   •  E.g.  Elec?on  data  analysis   •  Flu  tracking   – Our  focus  -­‐  Twi:er   2  
  • 4.     Mining  Twi:er  Data   •  New  and  novel  applica?ons   – Public  health   – Personal  /Organiza?onal  popularity   – Scien?fic  dissemina?on   •  Necessitates  confidence  in  user  authen?city   – Challenged  by  cyber-­‐opportunists     •  Sybils  –  fake  profiles  created  to  mimic  users   – Some  general,  some  are  Iden?ty  clone  accounts   3  
  • 5.     Fake  Accounts  -­‐  problem   •  “Followers  for  hire”   – Infla?ng  follower  numbers  for  other  accounts   •  Fake  followers   – Bias  an  individual/organiza?on’s  popularity   – Alter  characteris?cs  of  the  audience   – Create  a  legi?macy  problem     •  A  mul?-­‐million  dollar  market  for  buying  fake   followers   4  
  • 6.     Proposed  approach   •  Most  of  the  current  Twi:er-­‐based  research   has  focused  on  analysis  of  tweets   •  Twi:er  profile-­‐based  analysis     – Faster  detec?on   – Simpler  logis?cs     •  smaller  storage,  memory  required   – More  comprehensive  detec?on   •  Can  also  detect  accounts  that  don’t  tweet   5  
  • 7.     Implementa?on   •  Twi:er  user  profiles   – Created  accounts  to  access  profiles   – Twi:er  API     – Breadth  first  search  over  a  set  of  seed  users   – Currently  collected  62  million  user  profiles   – JSON  (JAVA  Script  Object  Nota?on)  format   •  Mongodb   6  
  • 8.     Methodology   •  33  a:ributes  for  each  profile   •  Manual  inspec?on   – Most  of  these  a:ributes  are  set  to  default  values   •  10  Key  a:ributes   – name,  followers_count,  friends_count,  verified,   created_at,  descrip?on,  loca?on,  updated,   profile_image_url  and  screen_name     7  
  • 9.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups NameGroup by Methodology  
  • 10.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 Group by Methodology  
  • 11.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name Name + Description + Location e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 6.9 Million accounts 0.7 Million groups Methodology   Group by
  • 12.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 6.9 Million accounts 0.7 Million groups Number of groups: 78 Number of accounts: 804 Name + Description + Location Methodology   Group by
  • 13.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 6.9 Million accounts 0.7 Million groups Number of groups: 78 Number of accounts :804 Screen-name Patterns Name + Description + Location Methodology   Group by
  • 14.     Pa:ern  Recogni?on   13   Group  
  • 15.     Pa:ern  Recogni?on   14   Group  
  • 16.     Pa:ern  Recogni?on   15   Group   Name1   Base  Screen_Name   Calculate  SE1   SE  –  Shannon  Entropy  
  • 17.     Pa:ern  Recogni?on   16   Name2   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1  
  • 18.     Pa:ern  Recogni?on   17   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2  
  • 19.     Pa:ern  Recogni?on   18   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2  
  • 20.     Pa:ern  Recogni?on   19   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1  
  • 21.     Pa:ern  Recogni?on   20   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Add  Name2  to  current   collec?on   Yes  
  • 22.     Pa:ern  Recogni?on   21   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Add  Name2  to  current   collec?on   Yes   No  Add  to  group  for  analysis  in  next  itera?on  
  • 23.     Pa:ern  Recogni?on   22   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Add  Name2  to  current   collec?on   Yes   No  Add  to  group  for  analysis  in  next  itera?on  
  • 24.     Pa:ern  Recogni?on   23   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   SE  –  Shannon  Entropy   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Add  Name2  to  current   collec?on   Yes   No  Add  to  group  for  analysis  in  next  itera?on  
  • 25.     Pa:ern  Recogni?on   24   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Yes   No  Add  to  group  for  analysis  in  next  itera?on  
  • 26.     Pa:ern  Recogni?on   25   Name2   Concatenate   screen-­‐names   Select  Screen-­‐Name   for  analysis   Group   Name1   Base  Screen_Name   Calculate  SE1   Name1Name2   SE2   Is  SE2-­‐SE1<0.1   Yes   No  Add  to  group  for  analysis  in  next  itera?on  
  • 27.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 6.9 Million accounts 0.7 Million groups 8276 patterns Accounts: 263,250 Number of groups: 78 Number of accounts :804 Screen-name Patterns e.g. Patterns in a collection list: freefollow, SavedCompete, etc. Name + Description + Location Methodology   Group by
  • 28.     Elimina?on  of  false  posi?ves   •  Iden?fies  closely  associated  accounts   •  False  posi?ves  –  Highly  popular  names   •  Final  filter   – Determine  distribu?on  of  update  ?mes   •  Genuine  accounts  –  uniform  distribu?on   •  Fake  -­‐  non  uniform   27  
  • 29.     28   Update  (me  occurrences   Chopra   free_follow   Distribu?on  of  update  ?mes  as  a  func?on  of  ?me  of  day  
  • 30.     Raw User Profiles – 62 Million 21.2 Million accounts 3.8 Million groups Name e.g. Group name: FREE FOLLOW Number of groups: 1 Number of accounts :1833 6.9 Million accounts 0.7 Million groups 8276 patterns Accounts: 263,250 8873 groups Accounts: ~56,000 Number of groups: 78 Number of accounts :804 e.g. Patterns in a collection list: freefollow, SavedCompete, etc. Update times Core group of fake accounts: Same Name, Description, Location, and Update times and Screen names matching 1 or more patterns Screen-name Patterns Name + Description + Location Methodology   Group by
  • 31.     Analysis  of  Results   •  Genera?on  of  Ground  Truth  data   – Random  sample  of  Twi:er  user  profiles   – Similar  size  and  ?meline  for  consistency   •  Fake  Profile  characteris?cs   – Update  Times   – Crea?on  Times   – URL  Analysis     30  
  • 32.     Update  ?mes   31   Ground  truth   Fake  Profiles  
  • 33.     Update  days   32   Ground  truth   Fake  profiles  
  • 34.     Crea?on  ?mes   33   Ground  truth   Fake  Profiles   No  bias  in  the  crea?on  ?mes  of  ground  truth  data,  unlike  the  fake  profiles    
  • 35.     Crea?on  days   34   Ground  truth   Fake  profiles  
  • 36.     Crea?on  ?me  distribu?on   •  Fake  accounts   are  mostly   created  in   batches,  over   short  ?me   intervals   •  Note:  these   results  are  for   ~60000  accounts   35  
  • 37.     Crea?on  ?me  distribu?on   •  Fake  accounts   are  mostly   created  in   batches,  over   short  ?me   intervals   •  Note:  these   results  are  for   ~60000  accounts   36  
  • 38.     Crea?on  ?me  distribu?on   •  Fake  accounts   are  mostly   created  in   batches,  over   short  ?me   intervals   •  Note:  these   results  are  for   ~60000  accounts   37  
  • 39.     38   Collage  of   images  from   dis?nct   URLs  for  a   group  of   659   accounts   within  a   fake  profile   set  
  • 40.     Conclusions   •  A  large  Twi:er  profile  database  was  analyzed   – Fake  accounts  were  detected  based  on  their   profile  characteris?cs   •  Fake  accounts  tend  to  have  minimal  diversity   in:   – crea?on  ?mes   – update  ?mes   – image  URLs   •  The  crea?on  of  fake  accounts  is  possibly  only   semi-­‐automated   39  
  • 41.     Future  Work   •  Using  this  core  fake  data  as  seed  will  help  us   iden?fy  a  more  comprehensive  fake  profile  set   •  Pa:ern-­‐matching  focused  Algorithm   •  Understand  the  intra  and  inter  rela?onship  of   these  fake  accounts.   •  Tweet  Analysis  -­‐  Apache  SPARK  -­‐  GraphX   40  
  • 42.                QUESTIONS  ?   41