SlideShare a Scribd company logo
Filtering out improper user accounts from twitter
user accounts for discovering individuals interested
in certain topic
Chao CAI, Shun SHIRAMATSU
Dept. of Computer Science, Graduate School of Engineering, Nagoya Institute of Technology
Background
• Continuously growing demand on participant-scouting for online opinion
collection(Web-based debate system, online survey, etc. )
• Twitter as an SNS holding over 45 million monthly active users in Japan who can
be the latent participants
• Appearances of improper user account in the user accounts collected by certain
keywords
• (e.g. official account, Bot, etc.)
Collagree
• Web-based debate system
• Also used by local government of Nagoya for opinion collection
• We aim to develop a participant invitation agent
Procedure of invitation agent
Keyword list
extraction(or
prepare in
advance)
Gathering
and filtering
the initial
user account
set
More
specific
classification
of user
group
Participants
invitation
Definition of Improper user account
Official user: specific terms in user onscreen name or description
• (e.g. kousiki akkaunto or company name).
Inactive user: retweeting only the campaign contents, usually without a
description, onscreen name consisting of random characters combination
Robot user: specific terms in user onscreen name or description, description and
tweet content containing Ads or promotion.
• (e.g. bot)
Approach
• Collecting data with Twitter search API and streaming API based on keywords or
hashtags
• For keeping the balance of data (ratio of improper and individual account)
• MeCab for tokenization, TFIDF for vectorization before constructing feature vectors
• Two ways to generate feature vector, Mixed process and Separated process
• Mixed : processing tweet contents and user information (name and description) as one
document
• Separated : processing two parts as two documents in two different corpora
• Using rbf-SVM as learning model
• Performing well in binary classification task
Related work
A Machine Learning Approach to Twitter User Classification (Marco Pennacchiotti
2011)
• Proposed a general model for user profiling and ran a deep analysis on tweet
linguistic contents
• Designing the feature vectors with (1) user Information, (2) tweet contents, (3) tweet behavior and
(4) user relationship
• We dealt with (1) and (2) in this research.
• Not considering description as good-quality information
• 48% of English users not having bio in their description
• Over 50% of Avatar irrelevant to their classification task
• Only aiming for English twitter user
• Differences of use habit between English and Japanese users
Each Tweet
data
User information
(onscreen name &
description)
Tweet contents
tf-idf of one
term
First second third … First
secon
d
third …
Combine
First
secon
d
third … First
secon
d
third …
Information
vector Text vector
Feature vector (Separated)
Tokenization
and
vectorization
First
secon
d
third …
Feature vector (Mixed)
tf-idf of one
term
Training data
• We assumed a particular topic: “child care”
• Firstly collected by streaming API based on keyword
list (子育て, 育児, 待機児童, 育休,ホームスタート, マタニ
ティ, 出産, 子どもの貧困, シングルマザー, 産後, 保育)
• 269 tweet collected, 210 improper accounts, 59 individual
accounts
• Secondly collected by twitter search API based on
hashtag list (#あたしおかあさんだけど,あたしおかあさんだか
ら,#ぼくおとうさんだから,#おまえおとうさんなのに, #おまえお
とうさんだろ) obtained from Twitter trend
• 400 tweet collected, 37 improper account, 363 individual
account
• We fortunately found the hashtags suitable for collect
tweets by individual users
• The data consisted of 669 tweet texts with user
information
• 452 accounts are individuals and 247 ones are improper
accounts.
Improper
78%
Individual
22%
Improper
9%
Individual
91%
Main Idea: Binary Classification
based on the contents of individual
information and tweet
Example of user account groups
Improper
user
Individual
user
SVM settings
The experiment ran on 5 different hyperparameter settings using rbf-kernel SVM
C: the cost parameter
• cost parameter trade off misclassification of training samples against complexity of prediction
surface with gamma.
Default Setting1 Setting2 Setting3 Setting4
C 1 2x10-5 2x1015 2x10-5 2x1015
Gamma 1/n (n: number of
dimension)
2x10-15 2x10-15 2x103 2x103
Results of
experiment
s
Result of experiments
Separate
d 4-pt
higher in
setting2
Mixed 2-pt
higher in
setting2
Separated
1-pt
higher in
setting2
Evaluation
o All settings performing well on recall score:
ounbalance of the data
o Settings2 gave the best balanced performance on both prediction and recall
accuracy:
oThe large C and small gamma providing more support vectors to deal with the
similarity of data
o Manual labeling put an influence on the result
o Mistaken labeling
o Mixed and separated process both performing well
o Separated process providing more feature of data
Conclusions
the contents of user information and tweet can be the essential factor in filtering
task
Still not enough when dealing with much more data
Some keywords or hashtags appearing in Twitter trend may help collecting
individual account
Improper account requiring time to respond to the trend
The model expected to be lack of reliance when dealing with enormous data
Simplicity of feature vector for each user, considering only one tweet of the user
Future work
Propose a method to find hashtags or keywords which can provide mostly individual
accounts
Help collecting training data
Infer some features of improper accounts
Including tweet behavior and user relationship [Marco Pennacchiotti 2011] in feature
vector design
Deep learning will be considered if training data is much more
Link with existed platform (e.g. Collagree)
Experimenting the system in practice

More Related Content

PDF
Tag recommendation in social bookmarking sites like deli
PDF
Tag recommendation in social bookmarking sites like deli
PPTX
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
PDF
11_04_2019 EDUCON eMadrid special session on "Moods in MOOCs: analysing emoti...
PDF
Task oriented word embedding for text classification
DOCX
Abstract
PDF
Text Mining Project: Identification of Age and Gender in Social Networks
PDF
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
Tag recommendation in social bookmarking sites like deli
Tag recommendation in social bookmarking sites like deli
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
11_04_2019 EDUCON eMadrid special session on "Moods in MOOCs: analysing emoti...
Task oriented word embedding for text classification
Abstract
Text Mining Project: Identification of Age and Gender in Social Networks
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...

What's hot (17)

PDF
PDF
Arabic tweets categorization
PDF
IRJET- A Pragmatic Supervised Learning Methodology of Hate Speech Detection i...
PPTX
Tweets Classification
PDF
Dynamic learning of keyword-based preferences for news recommendation (WI-2014)
PPTX
Questions about questions
PPTX
Sentiment Analysis on Twitter
PPTX
Development of learned dictionary based spoken language
DOC
Doc format.
PDF
An Automatic Question Paper Generation : Using Bloom's Taxonomy
PDF
Supervised Approach to Extract Sentiments from Unstructured Text
PDF
Performance analysis of the
PPT
Pemrograman komputer 3 (representasi data)
PDF
Spelling correction systems for e-commerce platforms
PDF
SE-IT DSA THEORY SYLLABUS
PPTX
Interactive and Context-Aware Tag Spell Check and Correction
Arabic tweets categorization
IRJET- A Pragmatic Supervised Learning Methodology of Hate Speech Detection i...
Tweets Classification
Dynamic learning of keyword-based preferences for news recommendation (WI-2014)
Questions about questions
Sentiment Analysis on Twitter
Development of learned dictionary based spoken language
Doc format.
An Automatic Question Paper Generation : Using Bloom's Taxonomy
Supervised Approach to Extract Sentiments from Unstructured Text
Performance analysis of the
Pemrograman komputer 3 (representasi data)
Spelling correction systems for e-commerce platforms
SE-IT DSA THEORY SYLLABUS
Interactive and Context-Aware Tag Spell Check and Correction
Ad

Similar to Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic (20)

PDF
[CS570] Machine Learning Team Project (I know what items really are)
PPTX
IRE Project IIIT Hyderabad Tweet classification Group 37
PPTX
Tweets Classification using Naive Bayes and SVM
PDF
Measuring Opinion Credibility in Twitter
PDF
Measuring Opinion Credibility in Twiiter
PPTX
Svm and maximum entropy model for sentiment analysis of tweets
PPTX
Collaborative personalized tweet recommendation
PDF
merged_document
PDF
PDF
Personalized Retweet Prediction in Twitter
PDF
How Anonymous Can Someone be on Twitter?
PDF
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
PDF
Text mining on Twitter information based on R platform
DOCX
Twitter sentiment analysis basedon ordinal regression twitter
PDF
Consumer Purchase Intention Prediction System
PDF
Twitter sentimentanalysis report
PDF
IRJET- Categorization of Geo-Located Tweets for Data Analysis
PDF
Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
PDF
Predicting the success of altruistic requests
[CS570] Machine Learning Team Project (I know what items really are)
IRE Project IIIT Hyderabad Tweet classification Group 37
Tweets Classification using Naive Bayes and SVM
Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in Twiiter
Svm and maximum entropy model for sentiment analysis of tweets
Collaborative personalized tweet recommendation
merged_document
Personalized Retweet Prediction in Twitter
How Anonymous Can Someone be on Twitter?
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
Text mining on Twitter information based on R platform
Twitter sentiment analysis basedon ordinal regression twitter
Consumer Purchase Intention Prediction System
Twitter sentimentanalysis report
IRJET- Categorization of Geo-Located Tweets for Data Analysis
Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Predicting the success of altruistic requests
Ad

More from siramatu-lab (20)

PPTX
高出力BLEビーコンによる 認知症高齢者見守りのための 徘徊経路可視化機構の試作
PPTX
Web 議論の自動ファシリテーションのための事前知識を用いた質問生成手法
PPTX
議題の関連情報推薦によるIBIS構造作成支援システムの試作
PPTX
Watanabe civictechforum
PPTX
Supporting System of Improvisational Ensemble Based on User's Motion Using Sm...
PPTX
Prototype System for Recommending Academic Subjects for Students' Self Design...
PPTX
Tag-based Approaches to Sharing Background Information regarding Social Probl...
PPTX
Improvisation Ensemble Support Systems for Music Beginners based on Body Mot...
PPTX
韻律情報による議論の場の空気推定手法の検討
PPTX
即興合奏時のコード進行をユーザがデザインする機構の検討
PPTX
BLEビーコンを所持する徘徊高齢者のいち推定結果可視化機構の試作
PPTX
議論参加者の脳波による議論の場の空気推定手法の検討
PPTX
視線と表情を用いた議論の場の空気の推定手法の検討
PPTX
Ikeda ica2017
PPTX
ipsj全国大会発表スライド_水野
PPTX
2017ipsj全国大会発表スライド_宮脇
PPTX
2017ipsj全国大会発表スライド_一ノ瀬
PPTX
2017ipsj全国大会発表スライド_福本
PPTX
白松研卒論発表_渡辺
PPTX
2017ipsj全国大会発表スライド_池田
高出力BLEビーコンによる 認知症高齢者見守りのための 徘徊経路可視化機構の試作
Web 議論の自動ファシリテーションのための事前知識を用いた質問生成手法
議題の関連情報推薦によるIBIS構造作成支援システムの試作
Watanabe civictechforum
Supporting System of Improvisational Ensemble Based on User's Motion Using Sm...
Prototype System for Recommending Academic Subjects for Students' Self Design...
Tag-based Approaches to Sharing Background Information regarding Social Probl...
Improvisation Ensemble Support Systems for Music Beginners based on Body Mot...
韻律情報による議論の場の空気推定手法の検討
即興合奏時のコード進行をユーザがデザインする機構の検討
BLEビーコンを所持する徘徊高齢者のいち推定結果可視化機構の試作
議論参加者の脳波による議論の場の空気推定手法の検討
視線と表情を用いた議論の場の空気の推定手法の検討
Ikeda ica2017
ipsj全国大会発表スライド_水野
2017ipsj全国大会発表スライド_宮脇
2017ipsj全国大会発表スライド_一ノ瀬
2017ipsj全国大会発表スライド_福本
白松研卒論発表_渡辺
2017ipsj全国大会発表スライド_池田

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx

Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic

  • 1. Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic Chao CAI, Shun SHIRAMATSU Dept. of Computer Science, Graduate School of Engineering, Nagoya Institute of Technology
  • 2. Background • Continuously growing demand on participant-scouting for online opinion collection(Web-based debate system, online survey, etc. ) • Twitter as an SNS holding over 45 million monthly active users in Japan who can be the latent participants • Appearances of improper user account in the user accounts collected by certain keywords • (e.g. official account, Bot, etc.)
  • 3. Collagree • Web-based debate system • Also used by local government of Nagoya for opinion collection • We aim to develop a participant invitation agent
  • 4. Procedure of invitation agent Keyword list extraction(or prepare in advance) Gathering and filtering the initial user account set More specific classification of user group Participants invitation
  • 5. Definition of Improper user account Official user: specific terms in user onscreen name or description • (e.g. kousiki akkaunto or company name). Inactive user: retweeting only the campaign contents, usually without a description, onscreen name consisting of random characters combination Robot user: specific terms in user onscreen name or description, description and tweet content containing Ads or promotion. • (e.g. bot)
  • 6. Approach • Collecting data with Twitter search API and streaming API based on keywords or hashtags • For keeping the balance of data (ratio of improper and individual account) • MeCab for tokenization, TFIDF for vectorization before constructing feature vectors • Two ways to generate feature vector, Mixed process and Separated process • Mixed : processing tweet contents and user information (name and description) as one document • Separated : processing two parts as two documents in two different corpora • Using rbf-SVM as learning model • Performing well in binary classification task
  • 7. Related work A Machine Learning Approach to Twitter User Classification (Marco Pennacchiotti 2011) • Proposed a general model for user profiling and ran a deep analysis on tweet linguistic contents • Designing the feature vectors with (1) user Information, (2) tweet contents, (3) tweet behavior and (4) user relationship • We dealt with (1) and (2) in this research. • Not considering description as good-quality information • 48% of English users not having bio in their description • Over 50% of Avatar irrelevant to their classification task • Only aiming for English twitter user • Differences of use habit between English and Japanese users
  • 8. Each Tweet data User information (onscreen name & description) Tweet contents tf-idf of one term First second third … First secon d third … Combine First secon d third … First secon d third … Information vector Text vector Feature vector (Separated) Tokenization and vectorization First secon d third … Feature vector (Mixed) tf-idf of one term
  • 9. Training data • We assumed a particular topic: “child care” • Firstly collected by streaming API based on keyword list (子育て, 育児, 待機児童, 育休,ホームスタート, マタニ ティ, 出産, 子どもの貧困, シングルマザー, 産後, 保育) • 269 tweet collected, 210 improper accounts, 59 individual accounts • Secondly collected by twitter search API based on hashtag list (#あたしおかあさんだけど,あたしおかあさんだか ら,#ぼくおとうさんだから,#おまえおとうさんなのに, #おまえお とうさんだろ) obtained from Twitter trend • 400 tweet collected, 37 improper account, 363 individual account • We fortunately found the hashtags suitable for collect tweets by individual users • The data consisted of 669 tweet texts with user information • 452 accounts are individuals and 247 ones are improper accounts. Improper 78% Individual 22% Improper 9% Individual 91%
  • 10. Main Idea: Binary Classification based on the contents of individual information and tweet Example of user account groups Improper user Individual user
  • 11. SVM settings The experiment ran on 5 different hyperparameter settings using rbf-kernel SVM C: the cost parameter • cost parameter trade off misclassification of training samples against complexity of prediction surface with gamma. Default Setting1 Setting2 Setting3 Setting4 C 1 2x10-5 2x1015 2x10-5 2x1015 Gamma 1/n (n: number of dimension) 2x10-15 2x10-15 2x103 2x103
  • 12. Results of experiment s Result of experiments Separate d 4-pt higher in setting2 Mixed 2-pt higher in setting2 Separated 1-pt higher in setting2
  • 13. Evaluation o All settings performing well on recall score: ounbalance of the data o Settings2 gave the best balanced performance on both prediction and recall accuracy: oThe large C and small gamma providing more support vectors to deal with the similarity of data o Manual labeling put an influence on the result o Mistaken labeling o Mixed and separated process both performing well o Separated process providing more feature of data
  • 14. Conclusions the contents of user information and tweet can be the essential factor in filtering task Still not enough when dealing with much more data Some keywords or hashtags appearing in Twitter trend may help collecting individual account Improper account requiring time to respond to the trend The model expected to be lack of reliance when dealing with enormous data Simplicity of feature vector for each user, considering only one tweet of the user
  • 15. Future work Propose a method to find hashtags or keywords which can provide mostly individual accounts Help collecting training data Infer some features of improper accounts Including tweet behavior and user relationship [Marco Pennacchiotti 2011] in feature vector design Deep learning will be considered if training data is much more Link with existed platform (e.g. Collagree) Experimenting the system in practice

Editor's Notes

  • #2: Thanks for coming at first, please let me introduce myself my name is xxxxx from department of xxxxxx Today I want to talk about our own research, the title of which is xxxxxx --------------
  • #3: Lets begin with the background of our research Since we are living in the IT society. There is definitely growing need of xxxxx Where we can find latent participants, we considered the social network service such as twitter Twitter as an SNS is holding xxxx who we want to invite to those events But during the collection of user data based on certain keyword list, a lot of improper user appeared such as xxxx who we want to get rid of
  • #4: As we mentioned before, there are a lot of web-based debate system, in this research, we concentrated on collagree Collagree is aiming for consensus generating and also used by Nagoya government to collect regional residences opinion For the better use of this platform, we think it would be great if we can invite more people from different locations with a diversity of backgrounds to offer their new ideas So we plan to develop a participants invitation agent for this system.
  • #5: Here is the procedure of the whole agent Firstly the agent will receive a keyword list which can be prepared by human or extracted from the introduction of the debate topic And then the agent will collect the user set based on the list and filter out the improper user Before the agent actually sent the invitation, there will be a more specific classification of the user group to find out which user can really attend the debate And then the agent will sent the invitation to the users This research is focused on the second part, filter out the improper user. So which kind of account is improper
  • #6: Here is how we defined the improper account There are three kinds of them Official user account is used by company or public facilities, they usually have specific terms in their user onscreen name or description The second is inactive user, who only retweet the campaign contents for a gift, and they usually don’t have a description but with random characters combination in their onscreen name The last one is robot user who are also likely to have specific terms in their description or onscreen name such as bot, and often there are Ads or promotion information in their tweet. to filter out these kinds of accounts
  • #7: Here is the approach for this research To begin with, we collect the data with twitter search api and streaming api based on the keywords or hashtags to keep the balance of positive and negative samples Since they are mainly written in Japanese, we need mecab to tokenize and use the tfidf for vectorization We proposed two ways to generate the feature vector which are mix process and separated process which I like to demonstrate later So for the mixed process, we xxxxx And for separated process, we xxxx And we choose the rbf-kernel SVM as learning model since the SVM perform well in binary classification
  • #8: There are some related work. One was done by Marco Pennacchiotti 2011 Xxxxxx They proposed a general model for user profiling and ran a deep analysis on tweet linguistic contents. They designed their feature vector with four parts, xxxxxx And we dealt with xxxxx in our research However they did not consider xxxxx Since there are xxxxxxx and over 50% xxxx And their research was focused on English twitter user But there are definitely a lot of differences between English and Japanese user such as the use habit, language As I mentioned we utilized user information and tweet contents for feature vector design Here I like to give you a walk through about the design
  • #9: So firstly we got the initial data which consist of ----------- For the mixed process, we process these two parts as one document to generate one vector , and each dimension is filled with the value of TFIDF of one term this is the feature vector of mixed process for separated process, we process these these two parts separately to generate two vector by tfidf of course each dimension is filled with the value of tfidf of one term then we combine two vector into one And this is the feature vector of separated process And then we try this approach in practice
  • #10: Here is the training data for experiment We firstly assume a topic for debate in collagree, child care Then we collect the data twice First time is by streaming api based on the keyword list as you can see Among the 269 users, 78% percent are improper Send time we used the twitter search api based on the hashtag which we happened to find in twitter trend The hashtag is about a song which is related to child care In this time of search, 91 percent of all 400 tweets are tweeted by individual users So the whole data consisted of 669 tweets contents with user information
  • #11: Here’s the samples from each group in the data
  • #12: To do that, we use SVM ran on 5 different hyperparameter settings Xxxxx with different combination of C and gamma C by the way, is the cost parameter which will trade off the misclassification against complexity of prediction surface with gamma. --------
  • #13: Lets take a look at the result of experiments You can see setting2 gave the best performance on F measure And the separated process is one point higher than mixed process by setting2 On recall score separated is 4 point higher but on precision mixed one is 2 point higher And though all settings performed well on recall score only setting2 gave well performance on precision score
  • #14: And we consider that the unbalance and lack of data was the reason why all settings gave well performance on recall score And about the setting2 giving the best balance performance, we think that is because large C and small gamma providing more support vectors to deal with the similarity of data Since we label the data ourselves, the result could affected by the mistaken label of human. and though the mixed process and separated process both performed well, but we think that separated process can give more feature of the data
  • #15: Here the conclusions, We consider that the contents of user information and tweet can be important factors in filtering task But they are still not enough if the data is much more Some keyword list or hashtags in twitter trend probably can help collecting individual account We are considering that improper accounts may need time to react to those trend But the model is expected to be unreliable when dealing with a large amount of data Since the feature vector is similar with each other and we only consider one tweet for each user
  • #16: For our future work, We would like to find a way to detect the hashtags or keywords which can help us find more individual account to help us collect training data and maybe it can reveal some features of improper users We are planning to include tweet behavior and relationship of user in feature vector design, and we are considering introduce the deep learning into our research if we can get enough data Finally we want to connect our filter system to the existed platform, which in this case, collagree, to evaluate our system in practice