SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 728
Twitter Spam Detection using Cobweb
Vidhi Tiwari1, Akarsh Srivastava2, Mrs. P. Akilandeswari3
1SRMIST, Department of Computer Science and Engineering, Kattankulathur Campus
2SRMIST, Department of Computer Science and Engineering, Kattankulathur Campus
3Assistant Professor, Department of Computer Science and Engineering, Kattankulathur Campus, SRM Institute of
Science and Technology (formerly known as SRM University).
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - With the advent of online social networks, or OSN, there has been a significant rise of users daily, with twitter being
one of the largest social media networking sites in the world, for the world to have ever used. The network has been an ever-
growing web of users who may or may not have malicious intent towards society. With the increase in the number of people
using the OSN, there are more chances of exploitation. One of the major sources of exploitation is spam on the twitter feed. It is
known by survey, that 1 in every 21 tweets is spam. There are multiple ways a tweet could be used in harmful ways as well as
multiple ways a tweet can be made into a spam and hence, it is necessary to differentiate a tweet that is normal or spam. This
project aims to classify spam using a data stream clustering technique called cobweb and a result enhancer namely gradient
booster. Cobweb is an enhanced clustering algorithm that classifies data using trees. We aim to combine this to gradient
boosting, which is also a tree-based classifier to create a more enhanced spam detection system than there already are.
Key Words: Twitter; spam; cobweb; gradient boosting; URL; hashtag; phishing; mention; logarithmic function
1. INTRODUCTION
Ever since the dawn of the time of online social networks, OSNs (for example Twitter, Facebook, Instagram), there have
been new and upcoming users everyday. This growing network provides opportunities to an array of people using the
network since multiple accounts from a user are possible. While many users have been using this commendable
technology for greater use such as reaching out to help to employment opportunities, organizing fundraisers or reaching
out to help causes, it is often used for malicious motives which, unfortunately has a lot of channels. Twitter has been one of
the most leading social media networks since OSNs with a growth rate of the number of users being exponential. A survey
conducted in 2018 showed that there are a total of 231 million users on Twitter. Out of these about 200 million log onto
their accounts everyday. This gives a lot of ammunition to spammers. A statistical study showed that 1in every 21 tweets is
spam. With a number this high, we could be duped potentially every time we are logged in. Spammers use multiple
techniques to create spam tweets. The most popular method is the shortened URLs. Since the URLs’ destination isn’t
known, spammers use it to deceive users. Clicking on the link would lead you to get malware. Another method used is the
use of hashtags. Any word with the symbol of ‘’ in front of it can be used as a traffic generator. Tapping on it could lead to
potential spam. ‘Mentions’ work the same way as hashtags. They both generate traffic on and off the platform.
2. CONCEPTS IMPLEMENTED
2.1 Cobweb Clustering
Cobweb is a data stream clustering algorithm that was developed by Douglas H.Fisher. It is a system that works on the idea
of conceptual clustering of the input data. The algorithm observes the data and creates a classification tree. Each node
symbolizes a class and all the daughter nodes are expressed in the form of a probabilistic concept (PC) which is a probable
summarization of all attributes of the nodes which is used as a label for the parent node (class). There are a few operations
that cobweb executes to create the classification trees:
1) Merging of The Nodes: Merging the nodes requires replacement by a singular node whose children are the additive of
the children of the primary node set. The child node should summarize the attribute-value distributions of all objects
classified under them in the tree.
2) Splitting a given node: The node in reference is divided into two and replaced with its children nodes.
3) Inserting a new node: A node is created which represents the object that will be inserted into the tree.
4) Passing an object down the hierarchy of the tree: This step calls the Cobweb algorithm on the object and the subtree
rooted in the node.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 729
The basic algorithm for cobweb works by taking a node as an input for a record. If the base has no children, the node at
hand is set as the child node and the category of the child’s features is added to the tree. However, if the case has children,
the Category Utility is calculated for the given node and the best CU, that is, the smallest value leading class inherits the
given node as a child. This process is repeated for all entries of the dataset. The main reason why a classification tree is
used is to predict any missing attribute or recognize a new class of a new object and add it to the tree.
2.2 Gradient Boosting
Gradient boosting is one of the machine learning techniques for any set of classification and regression problems. It gives,
as result, a prediction model in the form of an assemblage of weaker prediction models, which is a method that uses
multiple weak algorithms to train into a better working model, typically decision trees. It works on building the model in a
stage-like fashion much like most of the boosting algorithms do, and it creates a more accurate system by allowing the
optimization of an arbitrary differentiable LS to give a generalized result. In many supervised algorithms used as learning
algorithms, the programmer has an output variable p and a vector of input variables namely q described using a joint
probability distribution P(p,q). Using a training set (p,q). . . (pn,qn) of already known values of p and corresponding values
of q, the goal is to find an approximate value to a function that gives a minimized value of the expected value of some
specified LF. The gradient boosting algorithm assumes a real-valued q and looks for an approximation for the value of p,
from the given set of values, in the form of a weighted sum of functions from some class, called base learners, also known
as weak learners.
3. STATE OF ART
Spam detection has had tremendous breakthroughs since the onset of everyday use of social media platforms, especially
on twitter.
In 2019, a novel stream clustering technique was designed which used incremental naive bayes classifier that was trained
to work efficiently on micro-clusters. The INBs capture the boundary and mean values of the microclusters, whereas the
Euclidean distance just utilizes the mean value of the clusters. This mostly gives skewed results for asymmetric big micro-
clusters. In this paper, DenStream was promoted to, by the proposed framework, to be called INB-DenStream. To show the
efficacy of INB-DenStream, common and known methods much like DenStream, CluStream and StreamKM++ were applied
to Twitter datasets and their performances were noted for values such as purity, overall precision, overall recall, F1
measure, parametric sensitivity, and computational complexity. The results of the comparison showed the superiority of
their method to the rivals in almost all the datasets.
In 2017, Mahdi Washha,, Aziz Qaroush, Manel Mezghani and Florence Sedes formulated the Hidden Markov Model (HMM)
to be a time dependent model for real-time filtering of spam tweets solely based on the topic of the tweets. This method is
only functional on the easily available and openly accessible meta-data in the tweet object to detect spam tweets exiting
from a stream of tweets related to a topic (e.g., #Trump), while considering the state of previously handled tweets
associated with the same topic. Compared to the robust and well known classical time-independent classification method,
the Random Forest algorithm, the experimental evaluation shows that the efficiency increases on the increase the quality
of the topics in terms of overall precision, recalls, and F measure performance metrics.
In 2018, Rutuja Katpatal, Aparna Junnarkar studied the various classifiers and performed and presented a comparative
study of them all for research purposes.
Anti Trust Rank algorithm is a known link-based spam detection algorithm which works on the principle that spam pages
are more likely to be referenced by other spam pages. It may function on any possible platform. Since a real world
connected web graph involves multiple billions of nodes, it is crucial to develop work efficient spam detection algorithms
for making the process easier. Thus, in 2018, Yeon Seong Jeong, Joyce Jiyoung Whang, Inderjit S. Dhillon, Seung Goo Kang
and Jungmin Lee developed the asynchronous Anti Trust Rank algorithm which allows us to reduce the number of
arithmetic operations required to detect spam as compared to the traditional synchronous Anti-TrustRank algorithm by a
large margin, without lowering the performance in detecting web spam.
Spam detecting systems mainly build their classification architecture which includes the binary classification of their data
and then it can be solved by a chosen machine learning algorithm that suits best to give the best possible results. The ML
algorithms, for example, Naïve Bayes classifier or support vector machine classifier then report the behavior of these
designed models. Hence, in 2016, Miss Shukla, Twinkle Kailas and Prof.D.B. Kshirsagar showed the effect of the data
related factors, namely the spam to non-spam ratio, the training data size, and data sampling, to the detection performance
of the system. The feature of the shown system was time varying spam tweet detection. The System showed that spam
detection was a huge challenge and it bridges the space between the performance evaluation. It mainly focused on the data
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 730
that was used for training and testing the system, the features and the model to identify the genuine, legitimate user and
report the spam user by providing the answer in a binary value, that is, 1 or 0.
4. INFERENCE FROM SURVEY
Inference from the survey was done for over multiple journal papers as well as conference papers for a clear
understanding of the subject. The basic idea of spam on twitter was gathered by papers written by Aparna Junnarkar,
Rutuja Katpatal, Aparna Junnarkar [3]; H. Tsukayama [11]; D. Song, K. Thomas,V. Paxson and C. Grier [13]; G. Stringhini,G.
Vigna and C. Kruegel [14]; Chao Chen [23]. The tested methods were understood and the results were analyzed. This gave
an average measure of values as results that we use to compare our results with. This provides a standard to our work,
given this is a system that is being built to provide an efficient alternate method to detect spam on twitter.
5. DATA SET ANALYSIS AND STUDY
With the use of machine learning algorithms: cobweb and gradient boosting, there is a need to train the algorithm before
we test the system for detecting spam. Hence, there are two sets of data sets that are used.
5.1 Training Data set
This system uses a training data set which is used to help increase the accuracy of the algorithm used. The accuracy of the
performance of the algorithm depends solely on the amount of training data used and the complexity of the data set, that
is, a good mix of data in the data set. Programmers divide the data set that is acquired into two uneven parts. The greater,
three-fourth is used for training while the smaller division is used for the testing. That is done when not a lot of data can be
gathered. This project allowed us to gather enough tweets to have multiple data sets.
Table -1: Training data set attributes
5.2 Testing Data Set
The testing data set is similar to the training data set but smaller in data entries as it is only used for testing the accuracy of
the system. In this case, it is missing an attribute for spam or not spam as that is the test, but at times the testing and
training data set is the same.
Table -2: Testing data set attributes
Attribute Data Type
1 Tweet ID Integer
2 Tweet Text
3 Followers Integer
Attribute Data Type
1 Tweet ID Integer
2 Tweet Text
3 Followers Integer
4 Following Integer
5 Is_retweet Binary
6 Location Text
7 Spam_or_notspam Text (Spam/Not Spam)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 731
4 Following Integer
5 Is_retweet Binary
6 Location Text
6. Methodology Used
With this project, we propose to attempt to improve the method with which spam on the social media platform, Twitter, is
detected.
We aim to implement cobweb, which is an enhanced conceptual clustering algorithm, that classifies data streams using
trees in combination with gradient boosting, which would be used to enhance the produced clustered structure fed into
the boosting technique to enhance the output accuracy.
This hybrid model works like a typical flow chart model where the cobweb technique would first create the tree by
creating classes and adding nodes to a class according to it’s Category Utility(CU). This is repeated for the length of the
data set and this tree is then fed into the gradient booster for enhancement of the classification. Gradient boosting creates
trees in a stage-wise manner using the weaker associations and trains the algorithm repeatedly to improve the
classification model.
7. Architecture Diagram
Fig -1: Architecture diagram representing the flow of the algorithm
8. MODULES
Module 1: Feature Extraction
Feature extraction is the method of reducing dimensions of the data at hand. This module will single out every part of the
tweet and label them separately. For example, all the characters after @ would be stored under a single username and
labeled. Some other features of the tweet include hashtags, links in the tweet, mentions, etc.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 732
Module 2: Matrix Creation
Matrix creation is the most crucial part of a big data working solution. This module creates a matrix out of the labeled data
from the previous module and feeds it into the main for prediction after the training of the algorithm is complete.
Module 3 : Training
The data from the complete dataset is divided into parts for training and testing. The bigger chunk of data is often used for
training the algorithm. Hence, this module of the project is concerned with training the algorithm or better functionality.
Module 4 : Testing
After the training of the algorithm, it is necessary to test the data on the remaining portion of the dataset that was saved
for testing. Hence, this module is designed for testing the trained algorithm to check if it gives the desired results.
Module 5 : Prediction
This module is concerned about prediction or final outcome of the entire project. It creates a classification report and lists
out the spam tweets and account IDs.
9. OUTPUTS AND RESULTS
The parameters used can vary greatly in the range of values that the algorithm can use. Hence it is important for them to
be reduced to a common range of values called scaling to be implemented throughout. The default range for Cobweb
clustering in 0.5. However it does not yield the correct results in all cases and needs to be changed accordingly.
Experimentally, different values of scaling were used against different numbers of inputs and the results were compiled
into a table, as given below in Fig-2.
Fig -2: Table with scaling values
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 733
The column heads contain the number of inputs during every iteration during the testing. First row heads contain the
scaling values used for every iteration.
The cells contain 2 values, the accuracy and time taken, in that order, for the corresponding values of scaling and number
of inputs.
For a particular number of inputs, the accuracy increases as the value of scaling increases. It stops increasing after scaling
settles down on a particular value. Similarly, time taken increases as the scaling value increases. Evidently, it is a
logarithmic function, the relation of scaling(s) and number of inputs(n). It is, therefore, optimal to use the value which
gives the maximum accuracy but takes the minimum time.
It is found to be:
(1)
In addition, the other parts of the results obtained can be divided into two sections namely graphic and declarative.
The graphic part of the output depicts the spam to not spam ratio, mostly depending on the amount of training given to the
data. This also brings statistical data to the output which can be used for comparison as shown in Fig-3.
Fig -3: Graphical Output
The plain declarative part of the output depicts the tweet id and in text, whether the tweet is spam or not, as depicted in
Figure 4. This is a rather tedious output to read due to the high amount of tweets in the data set.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 734
Fig -4: Declarative Output
10. CONCLUSION
Upon running the hybrid model for spam detection on the testing data, the desired output of tweet id and the declaration
of spam or not spam is achieved. The acquired value of f measure, which is dependent on the spam to non-spam ratio, is
found to be at par with the existing models. Hence, it is safe to conclude, this is an alternate yet successful hybrid model to
detect spam on twitter.
11. REFERENCES
[1] A Novel Stream Clustering Framework for Spam Detection in Twitter; Hadi Tajalizadeh and Reza Boostani; IEEE
TRANSACTIONS ONCOMPUTATIONAL SOCIAL SYSTEMS, VOL. 6, NO. 3, JUNE 2019
[2] A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering; Mahdi Washha, Aziz Qaroush, Manel
Mezghani, Florence Sedes; 21st International Conference on Knowledge Based and Intelligent Information and
Engineering Systems, KES2017, 6-8 September 2017, Marseille, France
[3] Spam Detection Techniques for Twitter; Rutuja Katpatal, Aparna Junnarkar; International Research Journal of
Engineering and Technology (IRJET), Volume 05, May 2018.
[4] Design of Machine Learning Approach For Spam Tweet Detection; Miss. Shukla Twinkle Kailas, Prof.D.B.Kshirsagar;
IJARIIE-ISSN (O), Volume 3, May 2018.
[5] Fast Asynchronous Anti-TrustRank forWeb Spam Detection; Joyce Jiyoung Whang, Yeon Seong Jeong,Inderjit S. Dhillon,
Seonggoo Kang, Jungmin Lee; MIS2, Marina Del Rey, CA, USA, Volume 02, Issue 5, April 2016.
[6] O. Kurasova, V. Marcinkevicius, V. Medvedev, A. Rapecka, and P. Stefanovic, “Strategies for big data clustering,” in Proc.
IEEE 26th Int. Conf. Tools Artif. Intell., Nov. 2014, pp. 740–747.
[7] Twitter spam detection; Shradha Hirve,Swarupa Kamble ; IJESC,49-57, April 2017.
[8] Constraine d NMF-base d semi-supervise d learning for social me dia spammer detection ;Dingguo Yu, Nan Chen, Frank
Jiang,Bin Fu, Aihong Qin; Elsevier-2018.
[9] Early filtering of ephemeral malicious accounts on Twitter; Sangho Lee, Jong Kim; Elsevier, Issue 48-57, 2015.
[10] Twitter Spam detection; Shradha Hirve,Swarupa Kamle; IJESC-2016.
[11] Twitter turns 7 : User sends over 400 million tweets everyday; H. Tsukayama; Washington Post(2013)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 735
[12] Twitter spammer detection using data stream clustering ; Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang,
April-2017.
[13] Suspended accounts in retrospect: An analysis of Twitter spam;K. Thomas, C. Grier, D. Song, and V. Paxson(2013).
[14] Detecting Spammers on social media; G. Stringhini, C. Kruegel, and G. Vigna(2015)
[15] An in-depth analysis of abuse on Twitter; J. Oliver, P. Pajares, C. Ke, C. Chen, and Y. Xiang,
[16] Semi-supervised spam detection in Twitter stream; S. Sedhai and A. Sun; IEEE Trans. Comput. Social Syst., vol. 5, no. 1,
2018
[17] A performance evaluation of machine learning-based streaming spam tweets detection; C Chen el at.; IEEE Trans.
Comput. Social Syst., vol. 2, no. 3, 2015
[18] Twitter spam detection based on deep learning; T. Wu, S. Liu, J. Zhang, and Y. Xiang, Elsevier-2017
[19] Statistical Features-Based Real-Time Detection of Drifted Twitter Spam; Chao Chen, Yu Wang, Jun Zhang, Yang Xiang,
Wanlei Zhou,Geyong Min; IEEE Ttransaction on Information Forensics and Security, VOL. 12, NO. 4, APRIL 2017
[20] Detecting spammer on twitter; F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida; Proc. 7th Annu. Collaboration,
Electron. Messaging, Anti-Abuse Spam Conf, 2010
[21] What is twitter, a social network or a news media; H. Kwak, C. Lee, H. Park, and S. Moon; Proc. 19th Int. Conf. World
Wide Web, 2010
[22] A Survey of Spam Detection Methods on Twitter; Abdullah Talha Kabakus and Resul Kara, International Journal of
Advanced Computer Science and Applications, Vol. 8, No. 3, 2017
[23] Investigating the deceptive information in Twitter spam; Chao Chen et al. ; Future Generation Computer Systems
(2017)
[24] WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream; Sangho Lee and Jong Kim;
IEEE transactions on dependable and secure computing - 2016
[25] Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter; Xiaomo Liu et al.;
25th ACM CIKM. 20
AUTHORS
Vidhi Tiwari is a fourth-year student currently
pursuing her bachelor’s degree in Computer
Science Engineering from SRM Institute of Science
and Technology. She has completed courses like
Web development in Php and Data Science in R.
Avid reader and learner, she is also interested in
writing.
Akarsh Srivastava is a fourth-year student
currently pursuing his bachelor’s degree in
COmputer Science from SRM Institute of Science
and Technology. An active participant in the
college club “Humanoid”, he has handled the
coding department of the same, leading him to
take part in hackathons. He has completed
courses such as Machine learning and is
interested in deep learning and neural nets.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 736
Mrs. P. Akhilandeswari holds the position of
Assistant Professor at SRM Institute of Science
and Technology in the department of Computer
Science till date. She has had over 15 years of
experience. Her interests in computing lie in IOT,
data science and big data, and cloud computing.

More Related Content

PDF
IRJET - Fake News Detection using Machine Learning
PDF
Graph-based Analysis and Opinion Mining in Social Network
PDF
Fake News Detection using Machine Learning
PDF
IRJET- Fake News Detection
PDF
Event detection and summarization based on social networks and semantic query...
PPTX
Using content and interactions for discovering communities in
PDF
IMPLIMENTATION ON DISTRIBUTED COOPERATIVE CACHING IN SOCIAL WIRELESS NETWORK ...
IRJET - Fake News Detection using Machine Learning
Graph-based Analysis and Opinion Mining in Social Network
Fake News Detection using Machine Learning
IRJET- Fake News Detection
Event detection and summarization based on social networks and semantic query...
Using content and interactions for discovering communities in
IMPLIMENTATION ON DISTRIBUTED COOPERATIVE CACHING IN SOCIAL WIRELESS NETWORK ...

What's hot (20)

PDF
Mcs 021
PDF
Study, analysis and formulation of a new method for integrity protection of d...
PDF
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEM
PDF
Link prediction 방법의 개념 및 활용
PPTX
Link prediction with the linkpred tool
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
Sub-Graph Finding Information over Nebula Networks
PPTX
PDF
Survey on Location Based Recommendation System Using POI
PPTX
36x48_new_modelling_cloud_infrastructure
PDF
IRJET - Suicidal Text Detection using Machine Learning
PDF
Top k query processing and malicious node identification based on node groupi...
PDF
PDF
Generating synthetic online social network graph data and topologies
PDF
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
PPTX
Twitter text mining using sas
PDF
IRJET - Cyberbulling Detection Model
PDF
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
PDF
SAS Text Mining
PDF
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
Mcs 021
Study, analysis and formulation of a new method for integrity protection of d...
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEM
Link prediction 방법의 개념 및 활용
Link prediction with the linkpred tool
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
Sub-Graph Finding Information over Nebula Networks
Survey on Location Based Recommendation System Using POI
36x48_new_modelling_cloud_infrastructure
IRJET - Suicidal Text Detection using Machine Learning
Top k query processing and malicious node identification based on node groupi...
Generating synthetic online social network graph data and topologies
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
Twitter text mining using sas
IRJET - Cyberbulling Detection Model
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
SAS Text Mining
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
Ad

Similar to IRJET - Twitter Spam Detection using Cobweb (20)

PDF
Analyzing Social media’s real data detection through Web content mining using...
PDF
Fake News Detection using Deep Learning
PDF
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
PDF
IRJET - Fake News Detection: A Survey
PDF
Discovering Influential User by Coupling Multiplex Heterogeneous OSN’S
PDF
IRJET- Machine Learning: Survey, Types and Challenges
PDF
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
PDF
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
PDF
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
PDF
Fake News Detection using Passive Aggressive and Naïve Bayes
PDF
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
PDF
A study of cyberbullying detection using Deep Learning and Machine Learning T...
PDF
A study of cyberbullying detection using Deep Learning and Machine Learning T...
PDF
IRJET- Cross System User Modeling and Personalization on the Social Web
PDF
Detecting root of the rumor in social network using GSSS
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
PDF
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATION
PDF
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATION
PDF
Social Friend Overlying Communities Based on Social Network Context
Analyzing Social media’s real data detection through Web content mining using...
Fake News Detection using Deep Learning
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
IRJET - Fake News Detection: A Survey
Discovering Influential User by Coupling Multiplex Heterogeneous OSN’S
IRJET- Machine Learning: Survey, Types and Challenges
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Fake News Detection using Passive Aggressive and Naïve Bayes
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
IRJET- Cross System User Modeling and Personalization on the Social Web
Detecting root of the rumor in social network using GSSS
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATION
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATION
Social Friend Overlying Communities Based on Social Network Context
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Construction Project Organization Group 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
UNIT 4 Total Quality Management .pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PDF
PPT on Performance Review to get promotions
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Construction Project Organization Group 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
UNIT 4 Total Quality Management .pptx
573137875-Attendance-Management-System-original
UNIT-1 - COAL BASED THERMAL POWER PLANTS
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Internet of Things (IOT) - A guide to understanding
Sustainable Sites - Green Building Construction
PPT on Performance Review to get promotions
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
bas. eng. economics group 4 presentation 1.pptx

IRJET - Twitter Spam Detection using Cobweb

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 728 Twitter Spam Detection using Cobweb Vidhi Tiwari1, Akarsh Srivastava2, Mrs. P. Akilandeswari3 1SRMIST, Department of Computer Science and Engineering, Kattankulathur Campus 2SRMIST, Department of Computer Science and Engineering, Kattankulathur Campus 3Assistant Professor, Department of Computer Science and Engineering, Kattankulathur Campus, SRM Institute of Science and Technology (formerly known as SRM University). ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - With the advent of online social networks, or OSN, there has been a significant rise of users daily, with twitter being one of the largest social media networking sites in the world, for the world to have ever used. The network has been an ever- growing web of users who may or may not have malicious intent towards society. With the increase in the number of people using the OSN, there are more chances of exploitation. One of the major sources of exploitation is spam on the twitter feed. It is known by survey, that 1 in every 21 tweets is spam. There are multiple ways a tweet could be used in harmful ways as well as multiple ways a tweet can be made into a spam and hence, it is necessary to differentiate a tweet that is normal or spam. This project aims to classify spam using a data stream clustering technique called cobweb and a result enhancer namely gradient booster. Cobweb is an enhanced clustering algorithm that classifies data using trees. We aim to combine this to gradient boosting, which is also a tree-based classifier to create a more enhanced spam detection system than there already are. Key Words: Twitter; spam; cobweb; gradient boosting; URL; hashtag; phishing; mention; logarithmic function 1. INTRODUCTION Ever since the dawn of the time of online social networks, OSNs (for example Twitter, Facebook, Instagram), there have been new and upcoming users everyday. This growing network provides opportunities to an array of people using the network since multiple accounts from a user are possible. While many users have been using this commendable technology for greater use such as reaching out to help to employment opportunities, organizing fundraisers or reaching out to help causes, it is often used for malicious motives which, unfortunately has a lot of channels. Twitter has been one of the most leading social media networks since OSNs with a growth rate of the number of users being exponential. A survey conducted in 2018 showed that there are a total of 231 million users on Twitter. Out of these about 200 million log onto their accounts everyday. This gives a lot of ammunition to spammers. A statistical study showed that 1in every 21 tweets is spam. With a number this high, we could be duped potentially every time we are logged in. Spammers use multiple techniques to create spam tweets. The most popular method is the shortened URLs. Since the URLs’ destination isn’t known, spammers use it to deceive users. Clicking on the link would lead you to get malware. Another method used is the use of hashtags. Any word with the symbol of ‘’ in front of it can be used as a traffic generator. Tapping on it could lead to potential spam. ‘Mentions’ work the same way as hashtags. They both generate traffic on and off the platform. 2. CONCEPTS IMPLEMENTED 2.1 Cobweb Clustering Cobweb is a data stream clustering algorithm that was developed by Douglas H.Fisher. It is a system that works on the idea of conceptual clustering of the input data. The algorithm observes the data and creates a classification tree. Each node symbolizes a class and all the daughter nodes are expressed in the form of a probabilistic concept (PC) which is a probable summarization of all attributes of the nodes which is used as a label for the parent node (class). There are a few operations that cobweb executes to create the classification trees: 1) Merging of The Nodes: Merging the nodes requires replacement by a singular node whose children are the additive of the children of the primary node set. The child node should summarize the attribute-value distributions of all objects classified under them in the tree. 2) Splitting a given node: The node in reference is divided into two and replaced with its children nodes. 3) Inserting a new node: A node is created which represents the object that will be inserted into the tree. 4) Passing an object down the hierarchy of the tree: This step calls the Cobweb algorithm on the object and the subtree rooted in the node.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 729 The basic algorithm for cobweb works by taking a node as an input for a record. If the base has no children, the node at hand is set as the child node and the category of the child’s features is added to the tree. However, if the case has children, the Category Utility is calculated for the given node and the best CU, that is, the smallest value leading class inherits the given node as a child. This process is repeated for all entries of the dataset. The main reason why a classification tree is used is to predict any missing attribute or recognize a new class of a new object and add it to the tree. 2.2 Gradient Boosting Gradient boosting is one of the machine learning techniques for any set of classification and regression problems. It gives, as result, a prediction model in the form of an assemblage of weaker prediction models, which is a method that uses multiple weak algorithms to train into a better working model, typically decision trees. It works on building the model in a stage-like fashion much like most of the boosting algorithms do, and it creates a more accurate system by allowing the optimization of an arbitrary differentiable LS to give a generalized result. In many supervised algorithms used as learning algorithms, the programmer has an output variable p and a vector of input variables namely q described using a joint probability distribution P(p,q). Using a training set (p,q). . . (pn,qn) of already known values of p and corresponding values of q, the goal is to find an approximate value to a function that gives a minimized value of the expected value of some specified LF. The gradient boosting algorithm assumes a real-valued q and looks for an approximation for the value of p, from the given set of values, in the form of a weighted sum of functions from some class, called base learners, also known as weak learners. 3. STATE OF ART Spam detection has had tremendous breakthroughs since the onset of everyday use of social media platforms, especially on twitter. In 2019, a novel stream clustering technique was designed which used incremental naive bayes classifier that was trained to work efficiently on micro-clusters. The INBs capture the boundary and mean values of the microclusters, whereas the Euclidean distance just utilizes the mean value of the clusters. This mostly gives skewed results for asymmetric big micro- clusters. In this paper, DenStream was promoted to, by the proposed framework, to be called INB-DenStream. To show the efficacy of INB-DenStream, common and known methods much like DenStream, CluStream and StreamKM++ were applied to Twitter datasets and their performances were noted for values such as purity, overall precision, overall recall, F1 measure, parametric sensitivity, and computational complexity. The results of the comparison showed the superiority of their method to the rivals in almost all the datasets. In 2017, Mahdi Washha,, Aziz Qaroush, Manel Mezghani and Florence Sedes formulated the Hidden Markov Model (HMM) to be a time dependent model for real-time filtering of spam tweets solely based on the topic of the tweets. This method is only functional on the easily available and openly accessible meta-data in the tweet object to detect spam tweets exiting from a stream of tweets related to a topic (e.g., #Trump), while considering the state of previously handled tweets associated with the same topic. Compared to the robust and well known classical time-independent classification method, the Random Forest algorithm, the experimental evaluation shows that the efficiency increases on the increase the quality of the topics in terms of overall precision, recalls, and F measure performance metrics. In 2018, Rutuja Katpatal, Aparna Junnarkar studied the various classifiers and performed and presented a comparative study of them all for research purposes. Anti Trust Rank algorithm is a known link-based spam detection algorithm which works on the principle that spam pages are more likely to be referenced by other spam pages. It may function on any possible platform. Since a real world connected web graph involves multiple billions of nodes, it is crucial to develop work efficient spam detection algorithms for making the process easier. Thus, in 2018, Yeon Seong Jeong, Joyce Jiyoung Whang, Inderjit S. Dhillon, Seung Goo Kang and Jungmin Lee developed the asynchronous Anti Trust Rank algorithm which allows us to reduce the number of arithmetic operations required to detect spam as compared to the traditional synchronous Anti-TrustRank algorithm by a large margin, without lowering the performance in detecting web spam. Spam detecting systems mainly build their classification architecture which includes the binary classification of their data and then it can be solved by a chosen machine learning algorithm that suits best to give the best possible results. The ML algorithms, for example, Naïve Bayes classifier or support vector machine classifier then report the behavior of these designed models. Hence, in 2016, Miss Shukla, Twinkle Kailas and Prof.D.B. Kshirsagar showed the effect of the data related factors, namely the spam to non-spam ratio, the training data size, and data sampling, to the detection performance of the system. The feature of the shown system was time varying spam tweet detection. The System showed that spam detection was a huge challenge and it bridges the space between the performance evaluation. It mainly focused on the data
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 730 that was used for training and testing the system, the features and the model to identify the genuine, legitimate user and report the spam user by providing the answer in a binary value, that is, 1 or 0. 4. INFERENCE FROM SURVEY Inference from the survey was done for over multiple journal papers as well as conference papers for a clear understanding of the subject. The basic idea of spam on twitter was gathered by papers written by Aparna Junnarkar, Rutuja Katpatal, Aparna Junnarkar [3]; H. Tsukayama [11]; D. Song, K. Thomas,V. Paxson and C. Grier [13]; G. Stringhini,G. Vigna and C. Kruegel [14]; Chao Chen [23]. The tested methods were understood and the results were analyzed. This gave an average measure of values as results that we use to compare our results with. This provides a standard to our work, given this is a system that is being built to provide an efficient alternate method to detect spam on twitter. 5. DATA SET ANALYSIS AND STUDY With the use of machine learning algorithms: cobweb and gradient boosting, there is a need to train the algorithm before we test the system for detecting spam. Hence, there are two sets of data sets that are used. 5.1 Training Data set This system uses a training data set which is used to help increase the accuracy of the algorithm used. The accuracy of the performance of the algorithm depends solely on the amount of training data used and the complexity of the data set, that is, a good mix of data in the data set. Programmers divide the data set that is acquired into two uneven parts. The greater, three-fourth is used for training while the smaller division is used for the testing. That is done when not a lot of data can be gathered. This project allowed us to gather enough tweets to have multiple data sets. Table -1: Training data set attributes 5.2 Testing Data Set The testing data set is similar to the training data set but smaller in data entries as it is only used for testing the accuracy of the system. In this case, it is missing an attribute for spam or not spam as that is the test, but at times the testing and training data set is the same. Table -2: Testing data set attributes Attribute Data Type 1 Tweet ID Integer 2 Tweet Text 3 Followers Integer Attribute Data Type 1 Tweet ID Integer 2 Tweet Text 3 Followers Integer 4 Following Integer 5 Is_retweet Binary 6 Location Text 7 Spam_or_notspam Text (Spam/Not Spam)
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 731 4 Following Integer 5 Is_retweet Binary 6 Location Text 6. Methodology Used With this project, we propose to attempt to improve the method with which spam on the social media platform, Twitter, is detected. We aim to implement cobweb, which is an enhanced conceptual clustering algorithm, that classifies data streams using trees in combination with gradient boosting, which would be used to enhance the produced clustered structure fed into the boosting technique to enhance the output accuracy. This hybrid model works like a typical flow chart model where the cobweb technique would first create the tree by creating classes and adding nodes to a class according to it’s Category Utility(CU). This is repeated for the length of the data set and this tree is then fed into the gradient booster for enhancement of the classification. Gradient boosting creates trees in a stage-wise manner using the weaker associations and trains the algorithm repeatedly to improve the classification model. 7. Architecture Diagram Fig -1: Architecture diagram representing the flow of the algorithm 8. MODULES Module 1: Feature Extraction Feature extraction is the method of reducing dimensions of the data at hand. This module will single out every part of the tweet and label them separately. For example, all the characters after @ would be stored under a single username and labeled. Some other features of the tweet include hashtags, links in the tweet, mentions, etc.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 732 Module 2: Matrix Creation Matrix creation is the most crucial part of a big data working solution. This module creates a matrix out of the labeled data from the previous module and feeds it into the main for prediction after the training of the algorithm is complete. Module 3 : Training The data from the complete dataset is divided into parts for training and testing. The bigger chunk of data is often used for training the algorithm. Hence, this module of the project is concerned with training the algorithm or better functionality. Module 4 : Testing After the training of the algorithm, it is necessary to test the data on the remaining portion of the dataset that was saved for testing. Hence, this module is designed for testing the trained algorithm to check if it gives the desired results. Module 5 : Prediction This module is concerned about prediction or final outcome of the entire project. It creates a classification report and lists out the spam tweets and account IDs. 9. OUTPUTS AND RESULTS The parameters used can vary greatly in the range of values that the algorithm can use. Hence it is important for them to be reduced to a common range of values called scaling to be implemented throughout. The default range for Cobweb clustering in 0.5. However it does not yield the correct results in all cases and needs to be changed accordingly. Experimentally, different values of scaling were used against different numbers of inputs and the results were compiled into a table, as given below in Fig-2. Fig -2: Table with scaling values
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 733 The column heads contain the number of inputs during every iteration during the testing. First row heads contain the scaling values used for every iteration. The cells contain 2 values, the accuracy and time taken, in that order, for the corresponding values of scaling and number of inputs. For a particular number of inputs, the accuracy increases as the value of scaling increases. It stops increasing after scaling settles down on a particular value. Similarly, time taken increases as the scaling value increases. Evidently, it is a logarithmic function, the relation of scaling(s) and number of inputs(n). It is, therefore, optimal to use the value which gives the maximum accuracy but takes the minimum time. It is found to be: (1) In addition, the other parts of the results obtained can be divided into two sections namely graphic and declarative. The graphic part of the output depicts the spam to not spam ratio, mostly depending on the amount of training given to the data. This also brings statistical data to the output which can be used for comparison as shown in Fig-3. Fig -3: Graphical Output The plain declarative part of the output depicts the tweet id and in text, whether the tweet is spam or not, as depicted in Figure 4. This is a rather tedious output to read due to the high amount of tweets in the data set.
  • 7. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 734 Fig -4: Declarative Output 10. CONCLUSION Upon running the hybrid model for spam detection on the testing data, the desired output of tweet id and the declaration of spam or not spam is achieved. The acquired value of f measure, which is dependent on the spam to non-spam ratio, is found to be at par with the existing models. Hence, it is safe to conclude, this is an alternate yet successful hybrid model to detect spam on twitter. 11. REFERENCES [1] A Novel Stream Clustering Framework for Spam Detection in Twitter; Hadi Tajalizadeh and Reza Boostani; IEEE TRANSACTIONS ONCOMPUTATIONAL SOCIAL SYSTEMS, VOL. 6, NO. 3, JUNE 2019 [2] A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering; Mahdi Washha, Aziz Qaroush, Manel Mezghani, Florence Sedes; 21st International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2017, 6-8 September 2017, Marseille, France [3] Spam Detection Techniques for Twitter; Rutuja Katpatal, Aparna Junnarkar; International Research Journal of Engineering and Technology (IRJET), Volume 05, May 2018. [4] Design of Machine Learning Approach For Spam Tweet Detection; Miss. Shukla Twinkle Kailas, Prof.D.B.Kshirsagar; IJARIIE-ISSN (O), Volume 3, May 2018. [5] Fast Asynchronous Anti-TrustRank forWeb Spam Detection; Joyce Jiyoung Whang, Yeon Seong Jeong,Inderjit S. Dhillon, Seonggoo Kang, Jungmin Lee; MIS2, Marina Del Rey, CA, USA, Volume 02, Issue 5, April 2016. [6] O. Kurasova, V. Marcinkevicius, V. Medvedev, A. Rapecka, and P. Stefanovic, “Strategies for big data clustering,” in Proc. IEEE 26th Int. Conf. Tools Artif. Intell., Nov. 2014, pp. 740–747. [7] Twitter spam detection; Shradha Hirve,Swarupa Kamble ; IJESC,49-57, April 2017. [8] Constraine d NMF-base d semi-supervise d learning for social me dia spammer detection ;Dingguo Yu, Nan Chen, Frank Jiang,Bin Fu, Aihong Qin; Elsevier-2018. [9] Early filtering of ephemeral malicious accounts on Twitter; Sangho Lee, Jong Kim; Elsevier, Issue 48-57, 2015. [10] Twitter Spam detection; Shradha Hirve,Swarupa Kamle; IJESC-2016. [11] Twitter turns 7 : User sends over 400 million tweets everyday; H. Tsukayama; Washington Post(2013)
  • 8. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 735 [12] Twitter spammer detection using data stream clustering ; Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, April-2017. [13] Suspended accounts in retrospect: An analysis of Twitter spam;K. Thomas, C. Grier, D. Song, and V. Paxson(2013). [14] Detecting Spammers on social media; G. Stringhini, C. Kruegel, and G. Vigna(2015) [15] An in-depth analysis of abuse on Twitter; J. Oliver, P. Pajares, C. Ke, C. Chen, and Y. Xiang, [16] Semi-supervised spam detection in Twitter stream; S. Sedhai and A. Sun; IEEE Trans. Comput. Social Syst., vol. 5, no. 1, 2018 [17] A performance evaluation of machine learning-based streaming spam tweets detection; C Chen el at.; IEEE Trans. Comput. Social Syst., vol. 2, no. 3, 2015 [18] Twitter spam detection based on deep learning; T. Wu, S. Liu, J. Zhang, and Y. Xiang, Elsevier-2017 [19] Statistical Features-Based Real-Time Detection of Drifted Twitter Spam; Chao Chen, Yu Wang, Jun Zhang, Yang Xiang, Wanlei Zhou,Geyong Min; IEEE Ttransaction on Information Forensics and Security, VOL. 12, NO. 4, APRIL 2017 [20] Detecting spammer on twitter; F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida; Proc. 7th Annu. Collaboration, Electron. Messaging, Anti-Abuse Spam Conf, 2010 [21] What is twitter, a social network or a news media; H. Kwak, C. Lee, H. Park, and S. Moon; Proc. 19th Int. Conf. World Wide Web, 2010 [22] A Survey of Spam Detection Methods on Twitter; Abdullah Talha Kabakus and Resul Kara, International Journal of Advanced Computer Science and Applications, Vol. 8, No. 3, 2017 [23] Investigating the deceptive information in Twitter spam; Chao Chen et al. ; Future Generation Computer Systems (2017) [24] WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream; Sangho Lee and Jong Kim; IEEE transactions on dependable and secure computing - 2016 [25] Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter; Xiaomo Liu et al.; 25th ACM CIKM. 20 AUTHORS Vidhi Tiwari is a fourth-year student currently pursuing her bachelor’s degree in Computer Science Engineering from SRM Institute of Science and Technology. She has completed courses like Web development in Php and Data Science in R. Avid reader and learner, she is also interested in writing. Akarsh Srivastava is a fourth-year student currently pursuing his bachelor’s degree in COmputer Science from SRM Institute of Science and Technology. An active participant in the college club “Humanoid”, he has handled the coding department of the same, leading him to take part in hackathons. He has completed courses such as Machine learning and is interested in deep learning and neural nets.
  • 9. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 736 Mrs. P. Akhilandeswari holds the position of Assistant Professor at SRM Institute of Science and Technology in the department of Computer Science till date. She has had over 15 years of experience. Her interests in computing lie in IOT, data science and big data, and cloud computing.