Revealing Entities From Texts With a Hybrid Approach

Julien Plu, Giuseppe Rizzo, Raphaël Troncy
{firstname.lastname}@eurecom.fr,
@julienplu, @giusepperizzo, @rtroncy
Revealing Entities From Texts
With a Hybrid Approach

On June 21th, I went to Paris to see the Eiffel Tower and
to enjoy the world music day.
§ Goal: link (or disambiguate) entity mentions one can
find in text to their corresponding entries in a
knowledge base (e.g. DBpedia)
db:Paris db:Eiffel_Towerdb:Fête_de_la_Musiquedb:June_21
What is Entity Linking?
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 2

§ Extract entities in diverse type of textual documents:
Ø newspaper article, encyclopaedia article,
micropost (tweet, status, photo caption), video subtitle, etc.
Ø deal with grammar free and short texts that have littlecontext
§ Adapt what can be extracted depending on
guidelines or challenges
Ø #Micropost2014 NEEL challenge: link entities that may belong to:
Person, Location, Organization, Function, Amount, Animal, Event, Product,
Time, and Thing(languages, ethnic groups, nationalities, religions, diseases,
sports and astronomical objects)
Ø OKE2015 challenge: extract and link entities that must belong to:
Person, Location ,Organization, and Role
Problems

Research Question
How do we adapt an entity linking system
to solve these problems?

§ Input and output to different formats:
Ø Input: plain text, NIF, Micropost2014 (pruning phase)
Ø Output: NIF, TAC (tsv format), Micropost2014 (tsv format with no offset)
§ Text is classified according to its provenance
§ Text is normalized if necessary
For microposts content, RT symbols (in case of tweets) and emoticons are removed
Text
microposts
newspaper article,
video subtitle,
encyclopaedia article,
...
Text
Normalization Entity
Extractor
Entity
Linking
index
Pruning
ADEL
ADEL Workflow

§ Multiple extractors can be used:
Ø Possibility to switch on and off an extractor in order to adapt the system
to some guidelines
Ø Extractors can be:
Funsupervised: Dictionary, Hashtag + Mention, Number Extractor
Fsupervised: Date Extractor, POS Tagger, NER System
§ Overlaps are resolved by choosing the longest extracted
mention
Date
Extractor
Number
Extractor
POS
Tagger
(NNP/NNPS)
Dictionary NER
System (Stanford)
….
Hashtag +
Mention
Extractor
Overlap Resolution
Date Extractor: June 21
June 21
Number extractor: 21
Entity Extractor

§ From DBpedia:
Ø PageRank
Ø Title
Ø Redirects, Disambiguation
§ From Wikipedia:
Ø Anchors
Ø Link references
For example, from the EN Wikipedia article about Xabi Alonso:
index
(Arsenal F.C., 1);(Mikel Arteta, 2);
(San Sebastián, 1);(Liverpool, 2);
(Everton F.C., 1)
Alonso and [[Arsenal F.C.|Arsenal]] player [[Mikel Arteta]]
were neighbours on the same street while growing up in
[[San Sebastián]] and also lived near each other in
[[Liverpool]]. Alonso convinced [[Mikel Arteta|Arteta]] to
transfer to [[Everton F.C.|Everton] after he told him how
happy he was living in [[Liverpool]]].
How is the index created?

§ Generate candidates from a fuzzy match to the index
§ Filter candidates:
Ø Filter out candidates that are not semantically related
to other entities from the same sentence
§ Score each candidate using a linear formula:
score(cand) = (a * L(m, cand) + b * max(L(m, R(cand))) + c * max(L(m, D(cand)))) * PR(cand)
L for Levenshtein distance, R for set of redirects, D for set of disambiguation and PR for PageRank
a, b and c are weights set with a > b > c and a + b + c = 1
Candidate
Generation
Candidate
Filtering
Scoring
mention
index
query
Entity Linking

Sentence: I went to Paris to see the Eiffel Tower.
§ Generate Candidates:
Ø Paris: db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris
Ø Eiffel Tower: db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)
§ Filter candidates:
Ø db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris
Ø db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)
§ Scoring:
Ø Score(db:Paris)= (a * L(“Paris”, “Paris”) + b * max(L(“Paris”, R(“Parisien”, “Paname”))) + c *
max(L(“Paris”, D(“Paris (disambiguation)”)))) * PR(db:Paris)
Ø Score(db:Notre_Dame_de_Paris)= (a * L(“Paris”, “Notre Dame de Paris”) + b * max(L(“Paris”, R(“Nôtre
Dame”, “Paris Cathedral”))) + c * max(L(“Paris”, D(“Notre Dame”, “Notre Dame de Paris
(disambiguation)”)))) * PR(db:Notre_Dame_de_Paris)
Entity Linking example

§ k-NN machine learning algorithm training process:
Ø Run the system on a training set
Ø Classify entities as true/false according to the training set Gold Standard
Ø Create a file with the features of each entities and their true/false classification
Ø Train k-NN with the previous file to get a model
§ Use 10 features for the training:
• Length in number of characters
• Extracted mention
• Title
• Type
• PageRank
• HITS
• Number of inLinks
• Number of outLinks
• Redirects number
• Linking score
Training set ADEL
Create file
of features
Train
k-NN
Pruning

§ Tweets dataset
Ø Training set: 2340 tweets
Ø Test set: 1165 tweets
§ Link entities that may belong to one of these ten
types:
Ø Person, Location, Organization, Function, Amount, Animal, Event,
Product, Time, and Thing (languages, ethnic groups, nationalities,
religions, diseases, sports and astronomical objects)
§ Twitter user name dereferencing
§ Disambiguate in DBpedia 3.9
#Micropost2014 NEEL challenge

Results on #Micropost2014
§ Results of ADEL with and without pruning
§ Results of other systems
Without pruning With pruning
Precision Recall F-measure Precision Recall F-measure
Extraction 69.17 72.51 70.8 70 41.62 52.2
Linking 47.39 45.23 46.29 48.21 26.74 34.4
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02

§ Sentences from Wikipedia
Ø Training set: 96 sentences
Ø Test set: 101 sentences
§ Extract and link entities that must belong to one of
these four types:
Ø Person, Location, Organization and Role
§ Must disambiguate co-references
§ Allow emerging entities (NIL)
§ Disambiguate in DBpedia 3.9
OKE2015 challenge

Results on OKE2015
§ Results of ADEL with and without pruning
https://github.com/anuzzolese/oke-challenge
Without pruning With pruning
Precision Recall F-measure Precision Recall F-measure
Extraction 78.2 65.4 71.2 83.8 9.3 16.8
Recognition 65.8 54.8 59.8 75.7 8.4 15.1
Linking 49.4 46.6 48 57.9 6.2 11.1
ADEL FOX FRED
F-measure 60.75 49.88 34.73

#Micropost2015 NEEL challenge
§ Tweets dataset:
Ø Training set: 3498
Ø Development set: 500
Ø Test set: 2027
§ Extract and link entities that must belong to one of
these seven types:
Ø Person, Location, Organization, Character, Event, Product, and Thing
(languages, ethnic groups, nationalities, religions, diseases, sports and
astronomical objects)
§ Twitter user name dereferencing
§ Disambiguate in DBpedia 3.9 + NIL

Results on #Micropost2015
§ Results of ADEL without pruning
Ø Strong type mention match
Ø Strong link match (not considering the type correctness)
Precision Recall F-measure
Extraction 68.4 75.2 71.6
Recognition 62.8 45.5 52.8
Linking 48.8 47.1 47.9
ousia ADEL uva acubelab uniba ualberta cen_neel
F-measure 80.7 52.8 41.2 38.8 36.7 32.9 0
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0

Error Analysis
§ Issue for the extraction:
Ø “FB is a prime number.”
FFB stands for 251 in hexadecimal and will be extracted as Facebook acronym
by the wrong extractor
§ Issue for the filtering:
Ø “The series of HP books have been sold million times in France.”
FNo relation in Wikipedia between Harry Potter and France. Then no filtering is
applied.
§ Issue for the scoring:
Ø “The Spanish football player Alonso played twice for the national team
between 1954 and 1960.”
FXabi Alonso will be selected instead of Juan Alonso because of the PageRank.

§ Our system gives the possibility to adapt the entity
linking task to different kind of text
§ Our system gives the possibility to adapt the type of
extracted entities
§ Results are similar regardless of the kind of text
§ Performance at extraction stage similar to top
state-of-the-art systems (or slightly better)
§ Big drop of performance at linking stage mainly due
to an unsupervised approach
Conclusion

§ Add more adaptive features: language, knowledge base
§ Improve linking by using a graph-based algorithm:
Ø finding the common entities linked to each of the extracted entities
Ø example: “Rafael Nadal is a friend of Alonso” . There is no existing direct link
between Rafael Nadal and Alonso in DBpedia (or Wikipedia) but they have the
entity Spain in common
§ Improve pruning by:
Ø adding additional features:
Frelatedness: compute the relation score between one entity and all the others in the
text. If there are more than two, compute the average
FPOS tag of the previous and the next token in the sentence
Ø using other algorithms:
FEnsemble Learning
FUnsupervised Feature Learning + Deep Learning
Future Work

http://www.slideshare.net/julienplu
http://xkcd.com/1319/

Revealing Entities From Texts With a Hybrid Approach

More Related Content

What's hot (12)

Viewers also liked (19)

Similar to Revealing Entities From Texts With a Hybrid Approach (20)

Recently uploaded (20)

Revealing Entities From Texts With a Hybrid Approach