Supraja_SMS_presentation

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/280782699
Presentation - Fake Twitter accounts: Profile characteristics
obtained using an activity-based pattern detection
approach
DATASET · AUGUST 2015
READS
46
4 AUTHORS, INCLUDING:
Supraja Gurajala
5 PUBLICATIONS 8 CITATIONS
SEE PROFILE
Joshua S White
State University of New York Institute of Technology at …
68 PUBLICATIONS 25 CITATIONS
SEE PROFILE
Available from: Joshua S White
Retrieved on: 11 April 2016

Fake
Twi)er
accounts:
Proﬁle

characteris6cs

obtained
using
an

ac6vity-‐based
pa)ern
detec6on

approach

Supraja
Gurajala,
Joshua
S.
White,
Brian
Hudson,
and

Jeanna
N.
Ma:hews

Department
of
Mathema?cs
and
Computer
Science

Clarkson
University,
Potsdam,
NY

gurajas@clarkson.edu

Online
Social
Networks

•  OSNs
are
widely
used
for:

– Personal
communica?ons,
social
interac?ons

– Commercial
uses

•  Marke?ng
through
OSNs

•  Brand
growth

– Big
Data

•  E.g.
Elec?on
data
analysis

•  Flu
tracking

– Our
focus
-‐
Twi:er

2

Mining
Twi:er
Data

•  New
and
novel
applica?ons

– Public
health

– Personal
/Organiza?onal
popularity

– Scien?fic
dissemina?on

•  Necessitates
confidence
in
user
authen?city

– Challenged
by
cyber-‐opportunists

•  Sybils
–
fake
profiles
created
to
mimic
users

– Some
general,
some
are
Iden?ty
clone
accounts

3

Fake
Accounts
-‐
problem

•  “Followers
for
hire”

– Inﬂa?ng
follower
numbers
for
other
accounts

•  Fake
followers

– Bias
an
individual/organiza?on’s
popularity

– Alter
characteris?cs
of
the
audience

– Create
a
legi?macy
problem

•  A
mul?-‐million
dollar
market
for
buying
fake

followers

4

Proposed
approach

•  Most
of
the
current
Twi:er-‐based
research

has
focused
on
analysis
of
tweets

•  Twi:er
proﬁle-‐based
analysis

– Faster
detec?on

– Simpler
logis?cs

•  smaller
storage,
memory
required

– More
comprehensive
detec?on

•  Can
also
detect
accounts
that
don’t
tweet

5

Implementa?on

•  Twi:er
user
profiles

– Created
accounts
to
access
profiles

– Twi:er
API

– Breadth
first
search
over
a
set
of
seed
users

– Currently
collected
62
million
user
profiles

– JSON
(JAVA
Script
Object
Nota?on)
format

•  Mongodb

6

Methodology

•  33
a:ributes
for
each
profile

•  Manual
inspec?on

– Most
of
these
a:ributes
are
set
to
default
values

•  10
Key
a:ributes

– name,
followers_count,
friends_count,
verified,

created_at,
descrip?on,
loca?on,
updated,

profile_image_url
and
screen_name

7

Raw User Profiles – 62 Million
21.2 Million accounts
3.8 Million groups
NameGroup by
Methodology

3.8 Million groups
Name
e.g. Group name: FREE FOLLOW
Number of groups: 1
Number of accounts :1833
Group by
Methodology

3.8 Million groups
Name
Name + Description + Location
Number of groups: 1
0.7 Million groups
Methodology

Group by

3.8 Million groups
Name
Number of groups: 1
0.7 Million groups
Number of groups: 78
Number of accounts: 804
Methodology

Group by

3.8 Million groups
Name
Number of groups: 1
0.7 Million groups
Screen-name Patterns
Methodology

Group by

Pa:ern
Recogni?on

13

Group

Pa:ern
Recogni?on

14

Group

Pa:ern
Recogni?on

15

Group

Name1

Base
Screen_Name

Calculate
SE1

SE
–
Shannon
Entropy

Pa:ern
Recogni?on

16

Name2

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Pa:ern
Recogni?on

17

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

Pa:ern
Recogni?on

18

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Pa:ern
Recogni?on

19

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Pa:ern
Recogni?on

20

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Add
Name2
to
current

collec?on

Yes

Pa:ern
Recogni?on

21

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Add
Name2
to
current

collec?on

Yes

No
Add
to
group
for
analysis
in
next
itera?on

Pa:ern
Recogni?on

22

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Add
Name2
to
current

collec?on

Yes

No
Add
to
group
for
analysis
in
next
itera?on

Pa:ern
Recogni?on

23

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

SE
–
Shannon
Entropy

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Add
Name2
to
current

collec?on

Yes

No
Add
to
group
for
analysis
in
next
itera?on

Pa:ern
Recogni?on

24

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Yes

No
Add
to
group
for
analysis
in
next
itera?on

Pa:ern
Recogni?on

25

Name2

Concatenate

screen-‐names

Select
Screen-‐Name

for
analysis

Group

Name1

Base
Screen_Name

Calculate
SE1

Name1Name2

SE2

Is
SE2-‐SE1<0.1

Yes

No
Add
to
group
for
analysis
in
next
itera?on

3.8 Million groups
Name
Number of groups: 1
0.7 Million groups
8276 patterns
Accounts: 263,250
e.g. Patterns in a collection list:
freefollow, SavedCompete, etc.
Methodology

Group by

Elimina?on
of
false
posi?ves

•  Iden?ﬁes
closely
associated
accounts

•  False
posi?ves
–
Highly
popular
names

•  Final
ﬁlter

– Determine
distribu?on
of
update
?mes

•  Genuine
accounts
–
uniform
distribu?on

•  Fake
-‐
non
uniform

27

28

Update
(me
occurrences

Chopra
free_follow

Distribu?on
of
update
?mes
as
a
func?on
of
?me
of
day

3.8 Million groups
Name
Number of groups: 1
0.7 Million groups
8276 patterns
Accounts: 263,250
8873 groups
Accounts: ~56,000
e.g. Patterns in a collection list:
freefollow, SavedCompete, etc.
Update times
Core group of fake accounts:
Same Name, Description, Location, and Update times and
Screen names matching 1 or more patterns
Methodology

Group by

Analysis
of
Results

•  Genera?on
of
Ground
Truth
data

– Random
sample
of
Twi:er
user
proﬁles

– Similar
size
and
?meline
for
consistency

•  Fake
Proﬁle
characteris?cs

– Update
Times

– Crea?on
Times

– URL
Analysis

30

Update
?mes

31

Ground
truth
Fake
Proﬁles

Update
days

32

Ground
truth
Fake
proﬁles

Crea?on
?mes

33

Ground
truth
Fake
Proﬁles

No
bias
in
the
crea?on
?mes
of
ground
truth
data,
unlike
the
fake
proﬁles

Crea?on
days

34

Ground
truth
Fake
proﬁles

Crea?on
?me
distribu?on

•  Fake
accounts

are
mostly

created
in

batches,
over

short
?me

intervals

•  Note:
these

results
are
for

~60000
accounts

35

Crea?on
?me
distribu?on

•  Fake
accounts

are
mostly

created
in

batches,
over

short
?me

intervals

•  Note:
these

results
are
for

~60000
accounts

36

Crea?on
?me
distribu?on

•  Fake
accounts

are
mostly

created
in

batches,
over

short
?me

intervals

•  Note:
these

results
are
for

~60000
accounts

37

38

Collage
of

images
from

dis?nct

URLs
for
a

group
of

659

accounts

within
a

fake
proﬁle

set

Conclusions

•  A
large
Twi:er
proﬁle
database
was
analyzed

– Fake
accounts
were
detected
based
on
their

proﬁle
characteris?cs

•  Fake
accounts
tend
to
have
minimal
diversity

in:

– crea?on
?mes

– update
?mes

– image
URLs

•  The
crea?on
of
fake
accounts
is
possibly
only

semi-‐automated

39

Future
Work

•  Using
this
core
fake
data
as
seed
will
help
us

iden?fy
a
more
comprehensive
fake
proﬁle
set

•  Pa:ern-‐matching
focused
Algorithm

•  Understand
the
intra
and
inter
rela?onship
of

these
fake
accounts.

•  Tweet
Analysis
-‐
Apache
SPARK
-‐
GraphX

40

Supraja_SMS_presentation

More Related Content

Viewers also liked (14)

Similar to Supraja_SMS_presentation (20)

More from Joshua S. White, PhD josh@securemind.org (7)

Supraja_SMS_presentation