Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic

Filtering out improper user accounts from twitter
user accounts for discovering individuals interested
in certain topic
Chao CAI, Shun SHIRAMATSU
Dept. of Computer Science, Graduate School of Engineering, Nagoya Institute of Technology

Background
• Continuously growing demand on participant-scouting for online opinion
collection(Web-based debate system, online survey, etc. )
• Twitter as an SNS holding over 45 million monthly active users in Japan who can
be the latent participants
• Appearances of improper user account in the user accounts collected by certain
keywords
• (e.g. official account, Bot, etc.)

Collagree
• Web-based debate system
• Also used by local government of Nagoya for opinion collection
• We aim to develop a participant invitation agent

Procedure of invitation agent
Keyword list
extraction(or
prepare in
advance)
Gathering
and filtering
the initial
user account
set
More
specific
classification
of user
group
Participants
invitation

Definition of Improper user account
Official user: specific terms in user onscreen name or description
• (e.g. kousiki akkaunto or company name).
Inactive user: retweeting only the campaign contents, usually without a
description, onscreen name consisting of random characters combination
Robot user: specific terms in user onscreen name or description, description and
tweet content containing Ads or promotion.
• (e.g. bot)

Approach
• Collecting data with Twitter search API and streaming API based on keywords or
hashtags
• For keeping the balance of data (ratio of improper and individual account)
• MeCab for tokenization, TFIDF for vectorization before constructing feature vectors
• Two ways to generate feature vector, Mixed process and Separated process
• Mixed : processing tweet contents and user information (name and description) as one
document
• Separated : processing two parts as two documents in two different corpora
• Using rbf-SVM as learning model
• Performing well in binary classification task

Related work
A Machine Learning Approach to Twitter User Classification (Marco Pennacchiotti
2011)
• Proposed a general model for user profiling and ran a deep analysis on tweet
linguistic contents
• Designing the feature vectors with (1) user Information, (2) tweet contents, (3) tweet behavior and
(4) user relationship
• We dealt with (1) and (2) in this research.
• Not considering description as good-quality information
• 48% of English users not having bio in their description
• Over 50% of Avatar irrelevant to their classification task
• Only aiming for English twitter user
• Differences of use habit between English and Japanese users

Each Tweet
data
User information
(onscreen name &
description)
Tweet contents
tf-idf of one
term
First second third … First
secon
d
third …
Combine
First
secon
d
third … First
secon
d
third …
Information
vector Text vector
Feature vector (Separated)
Tokenization
and
vectorization
First
secon
d
third …
Feature vector (Mixed)
tf-idf of one
term

Training data
• We assumed a particular topic: “child care”
• Firstly collected by streaming API based on keyword
list (子育て, 育児, 待機児童, 育休,ホームスタート, マタニ
ティ, 出産, 子どもの貧困, シングルマザー, 産後, 保育)
• 269 tweet collected, 210 improper accounts, 59 individual
accounts
• Secondly collected by twitter search API based on
hashtag list (#あたしおかあさんだけど,あたしおかあさんだか
ら,#ぼくおとうさんだから,#おまえおとうさんなのに, #おまえお
とうさんだろ) obtained from Twitter trend
• 400 tweet collected, 37 improper account, 363 individual
account
• We fortunately found the hashtags suitable for collect
tweets by individual users
• The data consisted of 669 tweet texts with user
information
• 452 accounts are individuals and 247 ones are improper
accounts.
Improper
78%
Individual
22%
Improper
9%
Individual
91%

Main Idea: Binary Classification
based on the contents of individual
information and tweet
Example of user account groups
Improper
user
Individual
user

SVM settings
The experiment ran on 5 different hyperparameter settings using rbf-kernel SVM
C: the cost parameter
• cost parameter trade off misclassification of training samples against complexity of prediction
surface with gamma.
Default Setting1 Setting2 Setting3 Setting4
C 1 2x10-5 2x1015 2x10-5 2x1015
Gamma 1/n (n: number of
dimension)
2x10-15 2x10-15 2x103 2x103

Results of
experiment
s
Result of experiments
Separate
d 4-pt
higher in
setting2
Mixed 2-pt
higher in
setting2
Separated
1-pt
higher in
setting2

Evaluation
o All settings performing well on recall score:
ounbalance of the data
o Settings2 gave the best balanced performance on both prediction and recall
accuracy:
oThe large C and small gamma providing more support vectors to deal with the
similarity of data
o Manual labeling put an influence on the result
o Mistaken labeling
o Mixed and separated process both performing well
o Separated process providing more feature of data

Conclusions
the contents of user information and tweet can be the essential factor in filtering
task
Still not enough when dealing with much more data
Some keywords or hashtags appearing in Twitter trend may help collecting
individual account
Improper account requiring time to respond to the trend
The model expected to be lack of reliance when dealing with enormous data
Simplicity of feature vector for each user, considering only one tweet of the user

Future work
Propose a method to find hashtags or keywords which can provide mostly individual
accounts
Help collecting training data
Infer some features of improper accounts
Including tweet behavior and user relationship [Marco Pennacchiotti 2011] in feature
vector design
Deep learning will be considered if training data is much more
Link with existed platform (e.g. Collagree)
Experimenting the system in practice

Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic

More Related Content

What's hot (17)

Similar to Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic (20)

More from siramatu-lab (20)

Recently uploaded (20)

Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic

Editor's Notes