SlideShare a Scribd company logo
Semantic Filtering
An example of Semantic technologies for real-time
analysis
Pavan Kapanipathi
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University, USA
Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
Streams are everywhere
Social Data
Text
Images
Videos
Sensor Data
Streams
Information Overload
500M users generate 500M tweets per day
3
It’s not information overload.
It’s filter failure
-- Clay Shirky
Each of our projects face
Information Overload
• Disaster Management
• Hazards SEES
• Healthcare Issues
• Depression
• Societal Issues
• Edrug Trends
• Harassment
• Filtering is necessary
• Understanding the
requirements and utilizing
semantics for filtering is
important
Semantic Filtering
Two Main Topics
• Twarql
• Streaming annotation and flexible
querying on Twitter
• Continuous Semantics
• Tracking dynamic topics on Twitter
Twarql
Tracking health
care debate in the
United States on
Social Media
Health Care Reform
Health Care
Reform
Healthcare reform
legislation in the
United States
Patient Protection
and Affordable Care
Act (Obamacare)
Health Care Reform
Twarql
Extraction Pipeline - Tweet
I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard
Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF)
Dbpedia:Ipad
Dbpedia:Tablet
URLs
http://penguinkang.com/tweetprobe/
RDF
• RDF Annotation
• Common RDF/OWL Data formats.
• FOAF, SIOC, OPO, MOAT
: Health_care_reform
Twarql – Use Case
Demo
http://knoesis.wright.edu/library/tools/twarql/demo.swf
Continuous Semantics
13
Dynamic Topics
Continuously
Evolving on
Twitter
Entity – Event
relevance
changes
Many entities are
involved
14
Dynamic Topics
Manually crawl using
keywords
“indianelection”“jan25” “sandy”
“swineflu” “ebola”
15
Dynamic Topics
Manually updating keywords
to get topic relevant tweets is
not feasible
“indianelection”
“modi”
“bjp”
“congress”
“jan25”
“egypt”
“tunisia”
“arabspring”
“sandy”
“newyork”
“redcross”
“fema”
“swineflu” “ebola”
16
Problem
How can we automatically update the
filters to track a dynamically evolving
topic on Twitter
17
Hashtags as Filters
• Identify a topic on Twitter
• Tweets with hashtags are more
informative
• Users have a lot of freedom to
create them
• Some get popular, most die
18
Exploring Hashtags as Evolving Filters for
Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512
Distinct: 12,350
100% Retrieval: 7,763
Tags: 15,963,209
Distinct: 191,602
100% Retrieval: 21,314
HASHTAG
FILTERS 19
Top 1% retrieves
around 85% of the
tweets
Hashtag distributions
20
Colorado Shooting Occupy Wall Street
Event Related
Hashtags co-occur
with each other
Hashtag Filters Co-occurrence Graph
21
Summarizing Hashtag Analysis
Starting with one of the event relevant
hashtags, by co-occurrence we can reach
other relevant hashtags
22
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Too many
co-occurring hashtags
23
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold δ
Preferably a prominent hashtag
24
Hashtag Co-occurrence works?
o No. Just co-occurrence does not work
o Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of the top
co-occurring hashtag with the dynamic topic
25
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
δ
Normalized
Frequency
Scoring
26
(Vector Space Model)
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Dynamically Updated
Background Knowledge
δ
27
Event Relevant Background
Knowledge
o Wikipedia Event Pages
28
o Wikipedia Event Pages
Event Relevant Background
Knowledge
29
o Entities mentioned on the Event page of
Wikipedia are relevant to the Event
Event Relevant Background
Knowledge
30
o Wikipedia’s Hyperlink structure is very rich
o Page-Page (Wikipedia) links
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
Event Relevant Background
Knowledge – Graph Structure
31
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
One hop from Event
Page
δ
32
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
Event Relevant Background
Knowledge
33
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Event Relevant Background
Knowledge
34
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Event Relevant Background
Knowledge
35
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
δ
36
o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General
Election, 2014
Narendra Modi
India General
Election, 2014
India General
Election, 2009
1
Mutually
Important
ed (c,E) = 1
ed (c,E) = 2
37
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
δ
38
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
39
o Set Based
o Jaccard Similarity
o Considers the entities without the scores
o Vector Based
o Symmetric
o Cosine Similarity
o Asymmetric
o Subsumption Similarity
Similarity Check
40
India General
Election 2014
Narendra
Modi
Intuition behind
Asymmetric
India General
Election 2014
Narendra
Modi
Penalized
Ignored
Similarity
Symmetric
Asymmetric
41
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
42
o 2 events
o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
43
o Ranking Problem
o Rank the Top 25 hashtags based on the relevancy
of tweets to the event
o Experiment with all the similarity metrics
o Manually annotated the tweets of these hashtags
as relevant/irrelevant (Gold Standard)
o Ranking Evaluation Metrics
o Mean Average Precision
o NDCG
Evaluation –
Strategy
44
Evaluation
45
Evaluation
Evaluated tweets comprising of top-relevant
hashtags detected for dynamic topics
• NDCG - 92% at top-5 Mean Average
Precision
46
Conclusions
• Semantic Technologies for Real-time filtering of Social
Data
– Wikipedia as a Dynamic Knowledge base for events
– Determining relevant hashtags using Asymmetric similarity
measure
– More hashtags in turn increase the coverage of Tweets for
events
• Hashtag Analysis
– Co-occurrence technique can be used to detect event
relevant hashtags
– More popular hashtags are easier to be detected via co-
occurrence
47
Thanks
Contact: @pavankaps
pavan@knoesis.org

More Related Content

PDF
AIF360 - Trusted and Fair AI
PDF
Knowledge base enabled Information Filtering on Social Web -- EMC
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PPTX
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
PDF
Suicide prevention using social media analytics
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
PPTX
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
PPTX
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
AIF360 - Trusted and Fair AI
Knowledge base enabled Information Filtering on Social Web -- EMC
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Suicide prevention using social media analytics
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)

Viewers also liked (13)

PPTX
Twarql Architecture - Streaming Annotated Tweets
PDF
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
PPTX
Integrating Sensor and Social Data for Understanding City Events
PDF
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
PDF
Mastering the variety dimension of Big Data with semantic technologies: high ...
PDF
Semantics Approach to Big Data and Event Processing: an introduction focused ...
PPTX
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
PDF
Examples of Applied Semantic Technologies: Social Data Annotation
PPTX
Examples of Real-World Big Data Application
PDF
Mastering the Velocity Dimension of Big Data
PDF
RDF Streams and Continuous SPARQL (C-SPARQL)
PDF
Semantic Technologies for Big Data
Twarql Architecture - Streaming Annotated Tweets
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Integrating Sensor and Social Data for Understanding City Events
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Mastering the variety dimension of Big Data with semantic technologies: high ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Examples of Applied Semantic Technologies: Social Data Annotation
Examples of Real-World Big Data Application
Mastering the Velocity Dimension of Big Data
RDF Streams and Continuous SPARQL (C-SPARQL)
Semantic Technologies for Big Data
Ad

Similar to Knoesis-Semantic filtering-Tutorials (20)

PPTX
From Research to Applications: What Can We Extract with Social Media Sensing?
PDF
Lok Sabha Elections 2019: Analysing the Online Political Battlefield in India
PDF
Ингмар Вебер «Политическая поляризация в поисковых логах и Твиттере»
PDF
Big Data and the Semantic Web: Challenges and Opportunities
PPTX
Sentiment analysis of pre elections tweets (general elections)
PDF
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
PPT
Understanding the Diversity of Tweets in the Time of Outbreaks
PPTX
A framework for real time semantic social media analysis
PPTX
Mapping the 'Search Agenda' in Elections - ECREA Comms & Democracy 2013 Confe...
PPTX
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
PDF
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
PDF
On the Application of Social Data Science to Address Societal Challenges
PPT
CityPulse - Wright State University
PDF
Disinformation challenges tools and techniques to deal or live with it
PPTX
Forecasting Democratic Breakdown
PDF
Building Social Life Networks 130818
PDF
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
PDF
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
PDF
Toward a news data science
From Research to Applications: What Can We Extract with Social Media Sensing?
Lok Sabha Elections 2019: Analysing the Online Political Battlefield in India
Ингмар Вебер «Политическая поляризация в поисковых логах и Твиттере»
Big Data and the Semantic Web: Challenges and Opportunities
Sentiment analysis of pre elections tweets (general elections)
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Understanding the Diversity of Tweets in the Time of Outbreaks
A framework for real time semantic social media analysis
Mapping the 'Search Agenda' in Elections - ECREA Comms & Democracy 2013 Confe...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
On the Application of Social Data Science to Address Societal Challenges
CityPulse - Wright State University
Disinformation challenges tools and techniques to deal or live with it
Forecasting Democratic Breakdown
Building Social Life Networks 130818
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Toward a news data science
Ad

More from Pavan Kapanipathi (9)

PPTX
Improving Natural Language Inference Using External Knowledge in the Science ...
PDF
Personalized and Adaptive Semantic Information Filtering for Social Media
PDF
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
PDF
Hierarchical Interest Graphs from Twitter
PPTX
User Interests Identification From Twitter using Hierarchical Knowledge Base
PPTX
Random walk on Graphs
PDF
SemPuSH: ISWC 2011 Poster
PPTX
Privacy Aware Semantic Dissemination
PPTX
Personalized Filtering of Twitter Stream
Improving Natural Language Inference Using External Knowledge in the Science ...
Personalized and Adaptive Semantic Information Filtering for Social Media
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Hierarchical Interest Graphs from Twitter
User Interests Identification From Twitter using Hierarchical Knowledge Base
Random walk on Graphs
SemPuSH: ISWC 2011 Poster
Privacy Aware Semantic Dissemination
Personalized Filtering of Twitter Stream

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Cell Types and Its function , kingdom of life
PDF
01-Introduction-to-Information-Management.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
RMMM.pdf make it easy to upload and study
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Insiders guide to clinical Medicine.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Basic Mud Logging Guide for educational purpose
Cell Types and Its function , kingdom of life
01-Introduction-to-Information-Management.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism
RMMM.pdf make it easy to upload and study
102 student loan defaulters named and shamed – Is someone you know on the list?
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Insiders guide to clinical Medicine.pdf
TR - Agricultural Crops Production NC III.pdf
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf

Knoesis-Semantic filtering-Tutorials

  • 1. Semantic Filtering An example of Semantic technologies for real-time analysis Pavan Kapanipathi Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) Wright State University, USA Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
  • 2. Streams are everywhere Social Data Text Images Videos Sensor Data Streams
  • 3. Information Overload 500M users generate 500M tweets per day 3 It’s not information overload. It’s filter failure -- Clay Shirky
  • 4. Each of our projects face Information Overload • Disaster Management • Hazards SEES • Healthcare Issues • Depression • Societal Issues • Edrug Trends • Harassment
  • 5. • Filtering is necessary • Understanding the requirements and utilizing semantics for filtering is important Semantic Filtering
  • 6. Two Main Topics • Twarql • Streaming annotation and flexible querying on Twitter • Continuous Semantics • Tracking dynamic topics on Twitter
  • 7. Twarql Tracking health care debate in the United States on Social Media Health Care Reform Health Care Reform Healthcare reform legislation in the United States Patient Protection and Affordable Care Act (Obamacare) Health Care Reform
  • 9. Extraction Pipeline - Tweet I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF) Dbpedia:Ipad Dbpedia:Tablet URLs http://penguinkang.com/tweetprobe/
  • 10. RDF • RDF Annotation • Common RDF/OWL Data formats. • FOAF, SIOC, OPO, MOAT
  • 14. Dynamic Topics Continuously Evolving on Twitter Entity – Event relevance changes Many entities are involved 14
  • 15. Dynamic Topics Manually crawl using keywords “indianelection”“jan25” “sandy” “swineflu” “ebola” 15
  • 16. Dynamic Topics Manually updating keywords to get topic relevant tweets is not feasible “indianelection” “modi” “bjp” “congress” “jan25” “egypt” “tunisia” “arabspring” “sandy” “newyork” “redcross” “fema” “swineflu” “ebola” 16
  • 17. Problem How can we automatically update the filters to track a dynamically evolving topic on Twitter 17
  • 18. Hashtags as Filters • Identify a topic on Twitter • Tweets with hashtags are more informative • Users have a lot of freedom to create them • Some get popular, most die 18
  • 19. Exploring Hashtags as Evolving Filters for Dynamic Topics Colorado Shooting Occupy Wall Street CS OWS Tweets: 122,062 Tweets: 6,077,378 Tags: 192,512 Distinct: 12,350 100% Retrieval: 7,763 Tags: 15,963,209 Distinct: 191,602 100% Retrieval: 21,314 HASHTAG FILTERS 19
  • 20. Top 1% retrieves around 85% of the tweets Hashtag distributions 20
  • 21. Colorado Shooting Occupy Wall Street Event Related Hashtags co-occur with each other Hashtag Filters Co-occurrence Graph 21
  • 22. Summarizing Hashtag Analysis Starting with one of the event relevant hashtags, by co-occurrence we can reach other relevant hashtags 22
  • 23. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Too many co-occurring hashtags 23
  • 24. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold δ Preferably a prominent hashtag 24
  • 25. Hashtag Co-occurrence works? o No. Just co-occurrence does not work o Many noisy or unrelated hashtags co-occurs o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic 25
  • 26. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring δ Normalized Frequency Scoring 26 (Vector Space Model)
  • 27. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Dynamically Updated Background Knowledge δ 27
  • 28. Event Relevant Background Knowledge o Wikipedia Event Pages 28
  • 29. o Wikipedia Event Pages Event Relevant Background Knowledge 29
  • 30. o Entities mentioned on the Event page of Wikipedia are relevant to the Event Event Relevant Background Knowledge 30
  • 31. o Wikipedia’s Hyperlink structure is very rich o Page-Page (Wikipedia) links Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress Event Relevant Background Knowledge – Graph Structure 31
  • 32. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure One hop from Event Page δ 32
  • 33. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 Event Relevant Background Knowledge 33
  • 34. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 Event Relevant Background Knowledge 34
  • 35. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 20 May 2013 20 May 2013 Event Relevant Background Knowledge 35
  • 36. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page δ 36
  • 37. o Edge Based Measure o Link Overlap Measure: Jaccard similarity o Out(c) are the links in Wikipedia page “c” o Final Score: r(c,E) = ed(c,E) + oco(c,E) Hyperlink Entity Scoring India General Election, 2014 Narendra Modi India General Election, 2014 India General Election, 2009 1 Mutually Important ed (c,E) = 1 ed (c,E) = 2 37
  • 38. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 δ 38
  • 39. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 39
  • 40. o Set Based o Jaccard Similarity o Considers the entities without the scores o Vector Based o Symmetric o Cosine Similarity o Asymmetric o Subsumption Similarity Similarity Check 40
  • 41. India General Election 2014 Narendra Modi Intuition behind Asymmetric India General Election 2014 Narendra Modi Penalized Ignored Similarity Symmetric Asymmetric 41
  • 42. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 42
  • 43. o 2 events o US Presidential Elections (#election2012) o Hurricane Sandy (#sandy) o Top 25 co-occurring hashtags Evaluation – Dataset 43
  • 44. o Ranking Problem o Rank the Top 25 hashtags based on the relevancy of tweets to the event o Experiment with all the similarity metrics o Manually annotated the tweets of these hashtags as relevant/irrelevant (Gold Standard) o Ranking Evaluation Metrics o Mean Average Precision o NDCG Evaluation – Strategy 44
  • 46. Evaluation Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics • NDCG - 92% at top-5 Mean Average Precision 46
  • 47. Conclusions • Semantic Technologies for Real-time filtering of Social Data – Wikipedia as a Dynamic Knowledge base for events – Determining relevant hashtags using Asymmetric similarity measure – More hashtags in turn increase the coverage of Tweets for events • Hashtag Analysis – Co-occurrence technique can be used to detect event relevant hashtags – More popular hashtags are easier to be detected via co- occurrence 47

Editor's Notes

  • #3: Streams are everywhere. I am sure by now manualle, Dr. Sheth would have convinced you about this.
  • #4: In this presentation, we will focus on the social data streams
  • #8: So the journalist needs to track and he opts for some keyphrase healtcare reform. What would happen if we include semantics
  • #14: These examples such as epidemics, natural disasters, political events, and civil unrest are dynamic events.
  • #15: They are continuously evolving. They involve many other entities. for example during indian elections Modi, Rahul Gandhi, Congress, and BJP related to the event. And in many cases these entity-event relevance changes over time. For example, Considering hurricane sandy NYC was a part of it for 2 days.
  • #16: A naïve approach to get tweets relevant to an event is to use keywords as filters such as these. Twitter’s streaming API allows upto 400 keywords (unpaid) as filters.
  • #17: However we need to update these keywords as and when the topics evolve. This technique is very tedious and not feasible.
  • #18: We focus on this problem where we need to automatically update the filters
  • #19: We focus on this problem where we need to automatically update the filters
  • #20: The number of tweets collected for colorado shooting are around 125,000 where as OWC we collected approximately 6M tweets. Total number of distinct hashtags found in these tweets are 12K and 191k. In order to retrieve all the tweets of the event we need 7k for CS and 21k tags for OWS. In other words, if we had to automatically update all the filters, that are hashtags, we would need 7k hashtags to get all the tweets of Colorado Shooting and 21k tweets for OWS. It is important to note that we are just crawling for tweets with hashtags for now.
  • #21: And the top 1% of these hashtags retrieves around 85% of the tweets. So practically speaking, In order to retrieve the tweets, we need to find an approach to automatically reach these top event related hashtag on the go.
  • #22: This is how the co-occurrence graph of
  • #23: From these analysis we get to know that A small percentage of the hashtags should be detected to retrieve most tweets of the event These tweets can be reached via co-occurrence If we start with a popular event related hashtag we can reach other popular ones quickly due to its clustering co-efficient
  • #24: If we use co-occurrence as the primary strategy, we will reach a lot of hashtags as filters.
  • #25: Now coming back to the approach. The input is a hashtag that is relevant to the event, this is manually added and we hope that this is a prominent hashtag. Using a threshold delta we get the other relevant hashtag
  • #26: Hence we start with a manually added event relevant hashtag and periodicall determine the dynamic relevance of top co-occurring hashtag
  • #27: The latest 500 tweets of the hashtags are considered and entities from these tweets are scored based on its normalized frequency. Hence the hashtag is represented using a vector of entities. These entities just build a semantic context for the hashtag.
  • #28: Next we utilize the wikipedia page of the event and extract all the relevant entities to the event. The relevant entities are the ones present in the Wikipedia page.
  • #29: There are event wikipedia pages.
  • #30: Arab Springs
  • #31: The links to other wikipedia pages form a rich source of infomration. These can be the relevant semi structured knowledge for the event. For example, UPA, NDA, BJP, RG and NaMo are the entities mentioned in the event wikiipedia page and are relevant to the event.
  • #32: A graph structure can be created with all the entities on the Topic Wikipedia page. This is a subgraph of Indian General Election 2014, with links between entities of the event. Narendra Modi and Indian National Congress
  • #33: It is also important that the knowledge base has to be dynamically updated based on the changes in the event.
  • #34: Also, its dynamically updated
  • #35: Since it is crowd sourced pages on Wikipedia are updated in near real time. we are expecting the knowledge to be
  • #37: We score these entities based on its relevance to the event – Compare vector of entities of the event with thtat of the hashtag
  • #41: We use three different types of semantic similarity measures. (1) Set based, (2) Symmetric Vector Based, (3) Assymmetric Similarity
  • #42: Penalize to check how big subset is the hashtag compared to the event. #modisarkar cannot cover the whole of indian elections where as the vice versa is possible. Better way to explain this. What we want to penalize
  • #44: To evaluate, we picked two other events. US Presidential Elections and Hurricane Sandy. We picked #election2012 and #sandy as the starting point for the crawl. We retrieved 5000 tweets in real-time and extracted the top 25 co-occurring hashtags. For each of these hashtags we pick the latest 200 tweets for analysis.
  • #45: We rank the top-25 hashtags based on its relevance to the event. Hence, we transformed it to be a ranking problem using all the similarity measures. For evaluation, all the tweets of the 25 hashtags were manually evaluated to be relevant/irrelevant to the event.
  • #46: The subsumtion similarity works the best, where as if we look at just the co-occurrence of hashtags as a ranking mechanism it performs pretty low compared to that of using Wikipedia as knowledge base.
  • #47: Interesting to hear the findings. More about the results.
  • #49: This presentation emphasizes the value of semantics for social data filtering, specifically for the challenges faced during dynamically evolving event analysis.