SlideShare a Scribd company logo
Sub-Topic Detection Of Tweets
Related To An Entity
International Institute of Information Technology-Hyderabad
Mentor - Sandeep Pannem
By
P Yashaswi (201102111) Aayush Asawa(201305617)
Kumari Ankita(201101161) Diksha J. Yadav(201125130)
Introduction
➢ Tweets are classified according to the “Topic” and then the “Subtopic” they
refer to.
○ “Topic” refers to any major event in the real world.
○ “Subtopics” are fine-grained aspects of such events.
➢ Mining subtopics from entities/topics from tweets helps in trend
analysis, social monitoring, topic tracking and reputation
mining.
➢ Generally all tweets related to a particular entity have similar keywords. So,
while detecting the subtopics will have to deal with more features.
Work Flow
Training Data
Store
features in
Lucene
Classifier
(Phase 1,2,3)
Detected
Subtopic
Extract
Tweet
features
Input Tweet
Approach
Input : Training set of tweets which have subtopic names as class labels.
Test tweets which are to be classified into subtopics
Output : Assign subtopics to each of the test tweets
The entire workflow can be broken into three phases :
1. Pre-processing
2. Feature Extraction and Representation
3. Classification.
Feature Extraction
The following features are extracted from each tweet :
➢ TweetConcepts (using TagMe API)
➢ Named entity and event phrases( using Twical)
➢ URLConcepts(using TagMe API on the content in the external links)
➢ Key Phrases(extracting noun phrases after POS tagging)
➢ Hash tags
➢ Categories(extracting categories for the titles got though TagMe)
Similarity Measures used :
➢ Wikipedia miner(for comparing wikipedia titles)
➢ Wordnet similarity measure(to compare key phrases)
Classification
➢ Subtopic detection is considered as a classification problem where
subtopics are the class labels for the tweets which are the data points.
➢ The classifier derives logic from what features majority of the tweet
(datapoints) of a particular subtopic(class label) have.
➢ Based on the features initial seed clusters are created for each topic and
each cluster is represented as crisp information and index.
➢ The features of test tweets are found and compared with the clusters, and
then a cluster to which it best matches is assigned to the test tweet.
➢ This is done using Machine Learning technique.
Pre-Processing
Pre-processing involves the following steps :
➢ Removal of stopwords from the tweets and stemming from the training
data points.
➢ Extracting URLS from the tweets.
This is done for both training and test tweets.
Algorithm
Offline Process
1. All the tweets in the training data are grouped together according to their
sub topic
2. For every tweet in a subtopic, the features are extracted and are grouped to
form subtopic features.
3. The subtopic features of all the subtopic are stored in the lucene index
under different fields.
4. All those features that are common in two or more subtopics are removed,
also those features are removed that are directly related to the entity name.
Algorithm
Online Procedure
1. Phase 1 : The category features of the test tweet are searched in the lucene
index and the top 10 subtopics are listed.
2. Phase 2 : The tweet concepts and URL concepts of test tweet are compared
with that of the top 10 subtopics from Phase 1 and top 5 subtopics are
listed based on wikipedia miner similarity measure.
3. Phase 3 : NER, Key phrases, event phrases are compared with the top 5
category list from phase 2 using wordnet similarity measures. For hash tags
direct intersection is done .After this the best of 5 subtopics is chosen
All these can also be clubbed together to get the best subtopic
Experiments
➢ RepLab 2013 data set was used. The dataset contains tweets for 61entities.
Each entity has about 700 tweets for training and 1500 tweets for testing.
➢ For evaluation we use Reliability ,Sensitivity and F Measure.
The results that we got for the entity “Volvo” are:
Sensitivity : 0.37 , Reliability : 0.39 F measure : 0.38
Future Work
➢ We can build an SVM classifier which can accurately determine which
feature has to be given preference while classifying the tweets
➢ The input vectors would have dimensions as various features of various
subtopics with the corresponding similarity measures as the coefficients ,
where the labelled subtopic is the class label
➢ In the testing phase we can create similar vectors for test tweets to get their
corresponding subtopics
Reference
1. REINA at RepLab2013 Topic Detection Task: Community Detection
2. Entity Tracking in Real-Time using Sub-Topic Detection on Twitter

More Related Content

PDF
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
PPTX
Twitter sentiment analysis ppt
PPTX
Twitter sentiment analysis ppt
PPTX
Sentiment Analysis using Twitter Data
PPTX
Sentiment analysis of Twitter data using python
PPTX
Sentiment Analysis on Twitter
PPTX
Twitter sentiment analysis
DOCX
Twitter sentiment analysis project report
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
Twitter sentiment analysis ppt
Twitter sentiment analysis ppt
Sentiment Analysis using Twitter Data
Sentiment analysis of Twitter data using python
Sentiment Analysis on Twitter
Twitter sentiment analysis
Twitter sentiment analysis project report

What's hot (18)

PPTX
Tweets Classification
DOCX
Sentiment analysis using machine learning
PPTX
Twitter Sentiment Analysis
PPTX
Sentiment analysis using ml
PPTX
social network analysis project twitter sentimental analysis
PDF
Trend detection and analysis on Twitter
DOCX
Sentiment analysis in twitter using python
PPTX
New sentiment analysis of tweets using python by Ravi kumar
PPT
Combined queries
PDF
Twitter sentimentanalysis report
PDF
포스터_아미르호세인그다르지_2010-11804
PPTX
Ontology based sentiment analysis
PPTX
Mule filters
PPTX
sentiment analysis text extraction from social media
PPTX
Sentiment Analysis Using Twitter
PPTX
Opinion Mining – Twitter
DOCX
Comp 220 ilab 5 of 7
PDF
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
Tweets Classification
Sentiment analysis using machine learning
Twitter Sentiment Analysis
Sentiment analysis using ml
social network analysis project twitter sentimental analysis
Trend detection and analysis on Twitter
Sentiment analysis in twitter using python
New sentiment analysis of tweets using python by Ravi kumar
Combined queries
Twitter sentimentanalysis report
포스터_아미르호세인그다르지_2010-11804
Ontology based sentiment analysis
Mule filters
sentiment analysis text extraction from social media
Sentiment Analysis Using Twitter
Opinion Mining – Twitter
Comp 220 ilab 5 of 7
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
Ad

Viewers also liked (12)

PDF
Harnessing Web Page Directories for Large-Scale Classification of Tweets
PDF
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
PPTX
Classifying Microblogs For Disasters
PPTX
Discovering Context
PPT
Semantic Entity extraction from Sports Tweets
PPTX
warblecamp - twical
KEY
London Twitter Developer Nest - April 2010
PPTX
CLASSIFICATION OF TWEETS
PPT
Dan Foote Slide Show
KEY
Twitter API Annotations
PPTX
Tweets Classification using Naive Bayes and SVM
PDF
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Classifying Microblogs For Disasters
Discovering Context
Semantic Entity extraction from Sports Tweets
warblecamp - twical
London Twitter Developer Nest - April 2010
CLASSIFICATION OF TWEETS
Dan Foote Slide Show
Twitter API Annotations
Tweets Classification using Naive Bayes and SVM
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
Ad

Similar to SubTopic Detection of Tweets Related to an Entity (20)

PDF
IRJET- Categorization of Geo-Located Tweets for Data Analysis
PDF
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
PDF
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
PDF
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
PDF
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
PDF
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
PPTX
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
PDF
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
PDF
DISCOVERING USERS TOPIC OF INTEREST FROM TWEET
PDF
DISCOVERING USERS TOPIC OF INTEREST FROM TWEET
PDF
Discovering Users Topic of Interest from Tweet
PDF
IRE2014 Filtering Tweets Related to an entity
PDF
Ire major project
PDF
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
PDF
Social cyber-criminal, towards automatic real time recognition of malicious p...
PDF
An ensemble approach for the identification and classification of crime tweet...
PDF
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
PDF
Sensing Trending Topics in Twitter for Greater Jakarta Area
PDF
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
PDF
Group-13 Project 15 Sub event detection on social media
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
DISCOVERING USERS TOPIC OF INTEREST FROM TWEET
DISCOVERING USERS TOPIC OF INTEREST FROM TWEET
Discovering Users Topic of Interest from Tweet
IRE2014 Filtering Tweets Related to an entity
Ire major project
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Social cyber-criminal, towards automatic real time recognition of malicious p...
An ensemble approach for the identification and classification of crime tweet...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
Sensing Trending Topics in Twitter for Greater Jakarta Area
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
Group-13 Project 15 Sub event detection on social media

Recently uploaded (20)

PPTX
Introduction to Building Materials
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
1_English_Language_Set_2.pdf probationary
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
Empowerment Technology for Senior High School Guide
PDF
Hazard Identification & Risk Assessment .pdf
Introduction to Building Materials
Paper A Mock Exam 9_ Attempt review.pdf.
A systematic review of self-coping strategies used by university students to ...
Weekly quiz Compilation Jan -July 25.pdf
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
1_English_Language_Set_2.pdf probationary
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Orientation - ARALprogram of Deped to the Parents.pptx
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Chinmaya Tiranga quiz Grand Finale.pdf
History, Philosophy and sociology of education (1).pptx
Empowerment Technology for Senior High School Guide
Hazard Identification & Risk Assessment .pdf

SubTopic Detection of Tweets Related to an Entity

  • 1. Sub-Topic Detection Of Tweets Related To An Entity International Institute of Information Technology-Hyderabad Mentor - Sandeep Pannem By P Yashaswi (201102111) Aayush Asawa(201305617) Kumari Ankita(201101161) Diksha J. Yadav(201125130)
  • 2. Introduction ➢ Tweets are classified according to the “Topic” and then the “Subtopic” they refer to. ○ “Topic” refers to any major event in the real world. ○ “Subtopics” are fine-grained aspects of such events. ➢ Mining subtopics from entities/topics from tweets helps in trend analysis, social monitoring, topic tracking and reputation mining. ➢ Generally all tweets related to a particular entity have similar keywords. So, while detecting the subtopics will have to deal with more features.
  • 3. Work Flow Training Data Store features in Lucene Classifier (Phase 1,2,3) Detected Subtopic Extract Tweet features Input Tweet
  • 4. Approach Input : Training set of tweets which have subtopic names as class labels. Test tweets which are to be classified into subtopics Output : Assign subtopics to each of the test tweets The entire workflow can be broken into three phases : 1. Pre-processing 2. Feature Extraction and Representation 3. Classification.
  • 5. Feature Extraction The following features are extracted from each tweet : ➢ TweetConcepts (using TagMe API) ➢ Named entity and event phrases( using Twical) ➢ URLConcepts(using TagMe API on the content in the external links) ➢ Key Phrases(extracting noun phrases after POS tagging) ➢ Hash tags ➢ Categories(extracting categories for the titles got though TagMe) Similarity Measures used : ➢ Wikipedia miner(for comparing wikipedia titles) ➢ Wordnet similarity measure(to compare key phrases)
  • 6. Classification ➢ Subtopic detection is considered as a classification problem where subtopics are the class labels for the tweets which are the data points. ➢ The classifier derives logic from what features majority of the tweet (datapoints) of a particular subtopic(class label) have. ➢ Based on the features initial seed clusters are created for each topic and each cluster is represented as crisp information and index. ➢ The features of test tweets are found and compared with the clusters, and then a cluster to which it best matches is assigned to the test tweet. ➢ This is done using Machine Learning technique.
  • 7. Pre-Processing Pre-processing involves the following steps : ➢ Removal of stopwords from the tweets and stemming from the training data points. ➢ Extracting URLS from the tweets. This is done for both training and test tweets.
  • 8. Algorithm Offline Process 1. All the tweets in the training data are grouped together according to their sub topic 2. For every tweet in a subtopic, the features are extracted and are grouped to form subtopic features. 3. The subtopic features of all the subtopic are stored in the lucene index under different fields. 4. All those features that are common in two or more subtopics are removed, also those features are removed that are directly related to the entity name.
  • 9. Algorithm Online Procedure 1. Phase 1 : The category features of the test tweet are searched in the lucene index and the top 10 subtopics are listed. 2. Phase 2 : The tweet concepts and URL concepts of test tweet are compared with that of the top 10 subtopics from Phase 1 and top 5 subtopics are listed based on wikipedia miner similarity measure. 3. Phase 3 : NER, Key phrases, event phrases are compared with the top 5 category list from phase 2 using wordnet similarity measures. For hash tags direct intersection is done .After this the best of 5 subtopics is chosen All these can also be clubbed together to get the best subtopic
  • 10. Experiments ➢ RepLab 2013 data set was used. The dataset contains tweets for 61entities. Each entity has about 700 tweets for training and 1500 tweets for testing. ➢ For evaluation we use Reliability ,Sensitivity and F Measure. The results that we got for the entity “Volvo” are: Sensitivity : 0.37 , Reliability : 0.39 F measure : 0.38
  • 11. Future Work ➢ We can build an SVM classifier which can accurately determine which feature has to be given preference while classifying the tweets ➢ The input vectors would have dimensions as various features of various subtopics with the corresponding similarity measures as the coefficients , where the labelled subtopic is the class label ➢ In the testing phase we can create similar vectors for test tweets to get their corresponding subtopics
  • 12. Reference 1. REINA at RepLab2013 Topic Detection Task: Community Detection 2. Entity Tracking in Real-Time using Sub-Topic Detection on Twitter