Harnessing Linked Knowledge Sources for Topic Classification in Social Media

A. Elizabeth Cano, Andrea VargaŸ, Matthew Rowew, Fabio CiravegnaŸ, and
Yulan He°
Knowledge Media Institute, The Open University, Milton Keynes
Ÿ University of Sheffield, Sheffield
w Lancaster University, Lancaster
° Aston University, Birmingham
UK. 2013
Harnessing Linked Knowledge Sources for
Topic Classification in Social Media

INTRODUCTION
Social Media Streams - Risk in violent and criminal activities

INTRODUCTION
Research Questions:
o  Can semantic features help in topic classification (TC)?
o  Which knowledge source (KS) data and KS taxonomies
provide useful information for improving the TC of tweets?

OUTLINE
• Introduction
- Topic Classification (TC) of Microposts
- Related Work
- State of the art limitations
• Proposed Approach
• Experiments
• Findings
• Conclusions

INTRODUCTION
u  Difficulties of Topic Classification of microposts
o  Restricted number of characters
o  Irregular and ill-formed words
•  Mixing upper and lowercase letter
§  Makes it difficult to detect proper nouns, and other part of
speech tags.
•  Wide variety of language
§  E.g., “see u soon”
o  Event-dependent emerging jargon
• Volatile jargon relevant to particular events
§  E.g., “Jan.25” (used during the Egyptian revolution
o  High Topical Diversity
o  Sparse data

INTRODUCTION
Social Knowledge Sources (KS)
DBpedia* Yago2 Freebase
Resources 2.35 million 447million 3.6 million
Classes 359 562,312 1,450
Properties 1,820 253,213,842 7,000
*Using dbpedia ontology
o  Structured Semantic Web Representation of data
•  Maintained by thousand of editors
§  E.g DBpedia, derived from Wikipedia
§  Freebase
•  Evolves and adapts as knowledge changes [Syed et al,
2008]
o  Cover a broad range of topics
o  Characterise topics with a large number of resources

INTRODUCTION
Local and External Metadata of a Tweet

INTRODUCTION
NER:CountryNER:Person
NER:Person

INTRODUCTION
NER:CountryNER:Person
NER:Person
<http://dbpedia.org/resource/Barack_Obama
<http://dbpedia.org/resource/Egypt
<http://dbpedia.org/resource/Hosni_Mubarak

PROPOSED APPROACH
o  State of the art limitations
§  Use of single knowledge sources
§  Entities’ metadata is constrained by the used NER service
(e.g OpenCalais, Alchemy).
o  Our approach
§  Exploits multiple knowledge sources.
§  Enhances the entity metadata by deriving semantic graphs.
§  Leverages the graph structures surrounding entities present
in a KS for the TC task.
Exploiting Knowledge Sources for the Topic Classification of
Microposts

OUTLINE
• Introduction
• Semantic Meta-graphs
• Weighting Schemas
• Enhancing TC with Semantic Features
• Experiments
• Findings
• Conclusions

PROPOSED APPROACH
Rationale…
1
2

PROPOSED APPROACH
Rationale…
1
2
Could be more indicative
of War and Conflict

PROPOSED APPROACH
Rationale…
2
Not necessarily a good
indicator of War and
Conflict

PROPOSED APPROACH
Rationale…
1
2
Can the graph structure of existing Knowledge sources provide
an abstraction of the use of these entity types for representing a
topic ?

PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
1 Datasets Collection
SPARQL query for all resources from a
given Topic (e.g. War )

PROPOSED APPROACH
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Annotate
Tweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract
entities and link them to resources in
DBpedia and Freebase.

PROPOSED APPROACH
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Annotate
Tweets
3 Semantic Features Derivation

PROPOSED APPROACH
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Annotate
Tweets
4
Build a Topic Classifier based on Features
Derived from Crossed-Sources

PROPOSED APPROACH
Deriving Semantic Meta-Graphs
<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates>
<dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>

PROPOSED APPROACH
Definition 1- Resource Meta-graph
Is a sequence of tuples G:=(R,P,C,Y) where
•  R, P, C are finite sets whose elements are resources,
properties and classes;
•  Y is a ternary relation representing a
hypergraph with ternary edges.
•  Y is a tripartite graph where the vertices
are
Y ! R " P "C
H Y( ) = V, D
D = r, p,c{ } r, p,c( ) ! Y{ }

PROPOSED APPROACH
Resource Meta-graph
The meta-graph of entity e is the aggregation of all resources,
properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author

PROPOSED APPROACH
Resource Meta-graph
The meta-graph of entity e is the aggregation of all resources,
properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author
How can we weight these graphs to reveal semantic
features characterise Obama in the context of
Violence?
?
?
?
?
?? ?

PROPOSED APPROACH
Weighting Semantic Features
Specificity
Measures the relative importance of a property to
a given class in a KS graph GKS:
p ! G e( )
c ! G e( )
specificityKS p,c( ) = pN R(c)( )
N(R(c))

PROPOSED APPROACH
Generality
Captures the specialisation of a property p to a given class c,
by computing the property’s frequency among other
semantically related classes R’(c).
Where N(R’(c)) is the number of resources whose type is
either c or a specialisation of c’s parent classes.
generalityKS p,c( ) =
N R'(c)( )
pN (R'(c))

PROPOSED APPROACH
SG p,c( ) = specificityKS p,c( )! generalityKS p,c( )

PROPOSED APPROACH
Enhancing Feature Space with Semantic Features
Semantic Augmentation (A1)
Class Features
Property Features
Class+ Property Features
A1!CF' = F + CF
A1!PF' = F + pF
A1!C+PF' = F + cF + pF

PROPOSED APPROACH
Semantic Augmentation (A1)
Class Features
Property Features
Class+ Property Features
A1!CF' = F + CF
A1!PF' = F + pF
A1!C+PF' = F + cF + pF
F
president, obama, televised, statement, hosni, mubarak, resignation,
cnn, says, egypt
FA1+ P dbpedia:birth, dbpedia:state, …., dbpedia-owl:PopulatedPlace/
populationDensity….
FA1+ C
PopulatedPlace, Office_holder, PresidentOfTheUnitedStates,
Politician…

PROPOSED APPROACH
Semantic Augmentation with Generalisation (A2)
This augmentation exploits the subsumption relation among
classes within the DBpedia or Freebase ontologies. In this
cases we consider the set of parent classes of c.
Parent(c) Features
Parent(c) + Property Features
A2!CF' = F + parent(c)F
A2!C+PF' = F + pF + parent(c)F

PROPOSED APPROACH
Semantic Augmentation with Generalisation (A2)
This augmentation exploits the subsumption relation among
classes within the DBpedia or Freebase ontologies. In this
cases we consider the set of parent classes of c.
Parent(c) Features
Parent(c)+Property Features
A2!CF' = F + parent(c)F
A2!C+PF' = F + pF + parent(c)F
F
president, obama, televised, statement, hosni, mubarak, resignation,
cnn, says, egypt
FA2+ parent(c)
Place, Office_holder, President, Politician…

OUTLINE
• Introduction
• Experiments
• Dataset
• Baseline Features
• Results
• Findings
• Conclusions

PROPOSED APPROACH
Datasets
o  Twitter Dataset [Abel et al., 2011] (TW)
§  Collected during two months starting on Nov 2010.
§  Topically annotated
§  Using tweets labelled as “War & Conflict” (War),
“Law & Crime” (Cri), “Disaster &
Accident” (DisAcc).
§  Multilabelled dataset comprising 10,189 Tweets.
o  DBpedia (DB) and Freebase (FB) Dataset
§  SPARQL queried endpoints for all resources from
categories and subcategories of skos:concept of War,
Cri, DisAcc.
•  DBpedia – 9,465 articles
•  Freebase – 16,915 articles

PROPOSED APPROACH
Experimental Setup A
1.  Use annotated Tweets for training (TW)
-  Baseline: Bag of Words (BoW), Bag of Entities (BoE),
and Part of Speech tags (PoS).
-  Enhance Features using the DBpedia and Freebase
graphs.
2.  Train a SVM classifier based on the TW corpus. Trained/
Tested on 80%-20% over five independent runs.
3.  Compute Precision, Recall, and F-measure.

PROPOSED APPROACH
Results for TW dataset

PROPOSED APPROACH
Experimental Setup B
1.  Use labelled articles from DBpedia (DB) and Freebase
(FB) for training
-  Baseline: Bag of Words (BoW), Bag of Entities (BoE),
and Part of Speech tags (PoS).
-  Enhance Features using the DBpedia and Freebase
graphs.
2.  Train a SVM classifier based on the DB, FB, DB+FB, DB
+FB+TW training corpus and test on TW. Trained/Tested
on 80%-20% over five independent runs.
3.  Compute Precision, Recall, and F-measure.

PROPOSED APPROACH
Results for Training on KS articles, and Testing on TW

PROPOSED APPROACH
Factors contributing to the performance of a KS graph for TC
1.  Topic-Class Entropy
2.  Entity-Class Entropy
3.  Topic-Class-Property Entropy

PROPOSED APPROACH
Correlating Entropy metrics with the performance of the
cross-source TC classifiers.

PROPOSED APPROACH
Correlating Entropy metrics with the performance of the
cross-source TC classifiers.
Indicates that the higher the number of ambiguous
entities in a topic within a KS graph, the lower the
performance of the TC.

FINDINGS
1.  KSs combined with Twitter data provide complementary
information for TC of Tweets, outperforming the KS
approaches and the approach using Tweets only.
2.  A KS performance on TC depends on the coverage of
the entities within that KS.
3.  When entities have low coverage in a KS, exploiting the
mapping between corresponding KSs’ ontologies is
beneficial.

CONCLUSIONS
•  Explored the task of topic classification of tweets
•  Exploited information in KSs (e.g. DBpedia, Freebase)
using semantic graphs for concepts and properties
surrounding an entity.
•  Presented the importance of considering graph
structures in KSs for the supervised classification of
tweets, by achieving significant improvement over
various state-of-the-art approaches using both single
KSs and Tweets only.

CONTACT US
A.  Elizabeth Cano
•  http://people.kmi.open.ac.uk/cano/
B.  Andrea Varga
•  http://sites.google.com/site/missandreavarga/
C.  Matthew Rowe
•  http://lancs.ac.uk/staff/rowem/
D.  Fabio Ciravegna
•  http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna
E.  Yulan He
•  http://www1.aston.ac.uk/eas/staff/dr-yulan-he

Harnessing Linked Knowledge Sources for Topic Classification in Social Media

More Related Content

What's hot (6)

Viewers also liked (19)

Similar to Harnessing Linked Knowledge Sources for Topic Classification in Social Media (20)

More from Amparo Elizabeth Cano Basave (13)

Recently uploaded (20)

Harnessing Linked Knowledge Sources for Topic Classification in Social Media

Editor's Notes