SlideShare a Scribd company logo
Julien Plu, Giuseppe Rizzo, Raphaël Troncy
{firstname.lastname}@eurecom.fr,
@julienplu, @giusepperizzo, @rtroncy
Revealing Entities From Texts
With a Hybrid Approach
On June 21th, I went to Paris to see the Eiffel Tower and
to enjoy the world music day.
§ Goal: link (or disambiguate) entity mentions one can
find in text to their corresponding entries in a
knowledge base (e.g. DBpedia)
db:Paris db:Eiffel_Towerdb:Fête_de_la_Musiquedb:June_21
What is Entity Linking?
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 2
§ Extract entities in diverse type of textual documents:
Ø newspaper article, encyclopaedia article,
micropost (tweet, status, photo caption), video subtitle, etc.
Ø deal with grammar free and short texts that have littlecontext
§ Adapt what can be extracted depending on
guidelines or challenges
Ø #Micropost2014 NEEL challenge: link entities that may belong to:
Person, Location, Organization, Function, Amount, Animal, Event, Product,
Time, and Thing(languages, ethnic groups, nationalities, religions, diseases,
sports and astronomical objects)
Ø OKE2015 challenge: extract and link entities that must belong to:
Person, Location ,Organization, and Role
Problems
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 3
Research Question
How do we adapt an entity linking system
to solve these problems?
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 4
§ Input and output to different formats:
Ø Input: plain text, NIF, Micropost2014 (pruning phase)
Ø Output: NIF, TAC (tsv format), Micropost2014 (tsv format with no offset)
§ Text is classified according to its provenance
§ Text is normalized if necessary
For microposts content, RT symbols (in case of tweets) and emoticons are removed
Text
microposts
newspaper article,
video subtitle,
encyclopaedia article,
...
Text
Normalization Entity
Extractor
Entity
Linking
index
Pruning
ADEL
ADEL Workflow
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 5
§ Multiple extractors can be used:
Ø Possibility to switch on and off an extractor in order to adapt the system
to some guidelines
Ø Extractors can be:
Funsupervised: Dictionary, Hashtag + Mention, Number Extractor
Fsupervised: Date Extractor, POS Tagger, NER System
§ Overlaps are resolved by choosing the longest extracted
mention
Date
Extractor
Number
Extractor
POS
Tagger
(NNP/NNPS)
Dictionary NER
System (Stanford)
….
Hashtag +
Mention
Extractor
Overlap Resolution
Date Extractor: June 21
June 21
Number extractor: 21
Entity Extractor
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 6
§ From DBpedia:
Ø PageRank
Ø Title
Ø Redirects, Disambiguation
§ From Wikipedia:
Ø Anchors
Ø Link references
For example, from the EN Wikipedia article about Xabi Alonso:
index
(Arsenal F.C., 1);(Mikel Arteta, 2);
(San Sebastián, 1);(Liverpool, 2);
(Everton F.C., 1)
Alonso and [[Arsenal F.C.|Arsenal]] player [[Mikel Arteta]]
were neighbours on the same street while growing up in
[[San Sebastián]] and also lived near each other in
[[Liverpool]]. Alonso convinced [[Mikel Arteta|Arteta]] to
transfer to [[Everton F.C.|Everton] after he told him how
happy he was living in [[Liverpool]]].
How is the index created?
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 7
§ Generate candidates from a fuzzy match to the index
§ Filter candidates:
Ø Filter out candidates that are not semantically related
to other entities from the same sentence
§ Score each candidate using a linear formula:
score(cand) = (a * L(m, cand) + b * max(L(m, R(cand))) + c * max(L(m, D(cand)))) * PR(cand)
L for Levenshtein distance, R for set of redirects, D for set of disambiguation and PR for PageRank
a, b and c are weights set with a > b > c and a + b + c = 1
Candidate
Generation
Candidate
Filtering
Scoring
mention
index
query
Entity Linking
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 8
Sentence: I went to Paris to see the Eiffel Tower.
§ Generate Candidates:
Ø Paris: db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris
Ø Eiffel Tower: db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)
§ Filter candidates:
Ø db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris
Ø db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)
§ Scoring:
Ø Score(db:Paris)= (a * L(“Paris”, “Paris”) + b * max(L(“Paris”, R(“Parisien”, “Paname”))) + c *
max(L(“Paris”, D(“Paris (disambiguation)”)))) * PR(db:Paris)
Ø Score(db:Notre_Dame_de_Paris)= (a * L(“Paris”, “Notre Dame de Paris”) + b * max(L(“Paris”, R(“Nôtre
Dame”, “Paris Cathedral”))) + c * max(L(“Paris”, D(“Notre Dame”, “Notre Dame de Paris
(disambiguation)”)))) * PR(db:Notre_Dame_de_Paris)
Entity Linking example
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 9
§ k-NN machine learning algorithm training process:
Ø Run the system on a training set
Ø Classify entities as true/false according to the training set Gold Standard
Ø Create a file with the features of each entities and their true/false classification
Ø Train k-NN with the previous file to get a model
§ Use 10 features for the training:
• Length in number of characters
• Extracted mention
• Title
• Type
• PageRank
• HITS
• Number of inLinks
• Number of outLinks
• Redirects number
• Linking score
Training set ADEL
Create file
of features
Train
k-NN
Pruning
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 10
§ Tweets dataset
Ø Training set: 2340 tweets
Ø Test set: 1165 tweets
§ Link entities that may belong to one of these ten
types:
Ø Person, Location, Organization, Function, Amount, Animal, Event,
Product, Time, and Thing (languages, ethnic groups, nationalities,
religions, diseases, sports and astronomical objects)
§ Twitter user name dereferencing
§ Disambiguate in DBpedia 3.9
#Micropost2014 NEEL challenge
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 11
Results on #Micropost2014
§ Results of ADEL with and without pruning
§ Results of other systems
Without pruning With pruning
Precision Recall F-measure Precision Recall F-measure
Extraction 69.17 72.51 70.8 70 41.62 52.2
Linking 47.39 45.23 46.29 48.21 26.74 34.4
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 12
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
§ Sentences from Wikipedia
Ø Training set: 96 sentences
Ø Test set: 101 sentences
§ Extract and link entities that must belong to one of
these four types:
Ø Person, Location, Organization and Role
§ Must disambiguate co-references
§ Allow emerging entities (NIL)
§ Disambiguate in DBpedia 3.9
OKE2015 challenge
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 13
Results on OKE2015
§ Results of ADEL with and without pruning
§ Results of other systems
https://github.com/anuzzolese/oke-challenge
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 14
Without pruning With pruning
Precision Recall F-measure Precision Recall F-measure
Extraction 78.2 65.4 71.2 83.8 9.3 16.8
Recognition 65.8 54.8 59.8 75.7 8.4 15.1
Linking 49.4 46.6 48 57.9 6.2 11.1
ADEL FOX FRED
F-measure 60.75 49.88 34.73
#Micropost2015 NEEL challenge
§ Tweets dataset:
Ø Training set: 3498
Ø Development set: 500
Ø Test set: 2027
§ Extract and link entities that must belong to one of
these seven types:
Ø Person, Location, Organization, Character, Event, Product, and Thing
(languages, ethnic groups, nationalities, religions, diseases, sports and
astronomical objects)
§ Twitter user name dereferencing
§ Disambiguate in DBpedia 3.9 + NIL
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 15
Results on #Micropost2015
§ Results of ADEL without pruning
§ Results of other systems
Ø Strong type mention match
Ø Strong link match (not considering the type correctness)
Precision Recall F-measure
Extraction 68.4 75.2 71.6
Recognition 62.8 45.5 52.8
Linking 48.8 47.1 47.9
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 16
ousia ADEL uva acubelab uniba ualberta cen_neel
F-measure 80.7 52.8 41.2 38.8 36.7 32.9 0
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
Error Analysis
§ Issue for the extraction:
Ø “FB is a prime number.”
FFB stands for 251 in hexadecimal and will be extracted as Facebook acronym
by the wrong extractor
§ Issue for the filtering:
Ø “The series of HP books have been sold million times in France.”
FNo relation in Wikipedia between Harry Potter and France. Then no filtering is
applied.
§ Issue for the scoring:
Ø “The Spanish football player Alonso played twice for the national team
between 1954 and 1960.”
FXabi Alonso will be selected instead of Juan Alonso because of the PageRank.
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 17
§ Our system gives the possibility to adapt the entity
linking task to different kind of text
§ Our system gives the possibility to adapt the type of
extracted entities
§ Results are similar regardless of the kind of text
§ Performance at extraction stage similar to top
state-of-the-art systems (or slightly better)
§ Big drop of performance at linking stage mainly due
to an unsupervised approach
Conclusion
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 18
§ Add more adaptive features: language, knowledge base
§ Improve linking by using a graph-based algorithm:
Ø finding the common entities linked to each of the extracted entities
Ø example: “Rafael Nadal is a friend of Alonso” . There is no existing direct link
between Rafael Nadal and Alonso in DBpedia (or Wikipedia) but they have the
entity Spain in common
§ Improve pruning by:
Ø adding additional features:
Frelatedness: compute the relation score between one entity and all the others in the
text. If there are more than two, compute the average
FPOS tag of the previous and the next token in the sentence
Ø using other algorithms:
FEnsemble Learning
FUnsupervised Feature Learning + Deep Learning
Future Work
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 19
2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 20
http://www.slideshare.net/julienplu
http://xkcd.com/1319/

More Related Content

PDF
R Basics and Best Practices
PPTX
RDF, linked data and semantic web
PDF
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...
PDF
Visualize open data with Plone - eea.daviz PLOG 2013
PPTX
Named entity recognition (ner) with nltk
PPTX
F# Data: Making structured data first class citizens
PPTX
Prototype on Illuminated Manuscripts
PPTX
Dependency Parsing-based QA System for RDF and SPARQL
R Basics and Best Practices
RDF, linked data and semantic web
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...
Visualize open data with Plone - eea.daviz PLOG 2013
Named entity recognition (ner) with nltk
F# Data: Making structured data first class citizens
Prototype on Illuminated Manuscripts
Dependency Parsing-based QA System for RDF and SPARQL

What's hot (12)

PPTX
Stack & Queue
PPTX
Belfiore gsp-dpla-theme-final
PDF
SPARQL and the Open Linked Data initiative
PPTX
The Web, one huge database ...
PDF
Methodological Guidelines for Publishing Linked Data
PDF
IE: Named Entity Recognition (NER)
PPTX
Biblissima's prototype on Medieval Manuscripts Illuminations and their Context
DOCX
SPARQL queries on CIDOC-CRM data of BritishMuseum
PDF
when the link makes sense
PDF
Multiplicity and Publishing in Open Annotation (tutorial)
PPTX
OWL: Yet to arrive on the Web of Data?
PDF
Using the Structure of DBpedia for Exploratory Search
Stack & Queue
Belfiore gsp-dpla-theme-final
SPARQL and the Open Linked Data initiative
The Web, one huge database ...
Methodological Guidelines for Publishing Linked Data
IE: Named Entity Recognition (NER)
Biblissima's prototype on Medieval Manuscripts Illuminations and their Context
SPARQL queries on CIDOC-CRM data of BritishMuseum
when the link makes sense
Multiplicity and Publishing in Open Annotation (tutorial)
OWL: Yet to arrive on the Web of Data?
Using the Structure of DBpedia for Exploratory Search
Ad

Viewers also liked (19)

PDF
Publish de mil y vero
PPSX
Presentasimanager100
PDF
Cuadro evaluación UD
PDF
Untitled-1
PDF
Sulkavan yhtenäiskoulun Meidän Saimaa, ennen nyt ja tulevaisuudessa. Projekti...
PDF
Ресторан здорового питания Рецептор
PDF
¿Sabemos lo que comemos?
PDF
Recomendation
PDF
LITHUANIA_EN_v5
PPTX
Sabemos lo que comemos
PPTX
Genre differentiation
PDF
Geelong ict
PPTX
Do you know how fast you are going? Agile Tour London 2015
PDF
Proyecto: ¿Sabemos lo que comemos?
PPTX
How to building WEKA model and automatic test by command line
PDF
CIP project detail summary - jan. 2017
PPTX
Careers in ict
PPT
Nuclear waste and its management
PDF
Mh january 2016
Publish de mil y vero
Presentasimanager100
Cuadro evaluación UD
Untitled-1
Sulkavan yhtenäiskoulun Meidän Saimaa, ennen nyt ja tulevaisuudessa. Projekti...
Ресторан здорового питания Рецептор
¿Sabemos lo que comemos?
Recomendation
LITHUANIA_EN_v5
Sabemos lo que comemos
Genre differentiation
Geelong ict
Do you know how fast you are going? Agile Tour London 2015
Proyecto: ¿Sabemos lo que comemos?
How to building WEKA model and automatic test by command line
CIP project detail summary - jan. 2017
Careers in ict
Nuclear waste and its management
Mh january 2016
Ad

Similar to Revealing Entities From Texts With a Hybrid Approach (20)

PDF
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
PDF
Can Deep Learning Techniques Improve Entity Linking?
PDF
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
PDF
Enhancing Entity Linking by Combining NER Models
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
PDF
NELL: The Never-Ending Language Learning System
PDF
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
PPTX
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
PPTX
Improving Semantic Search Using Query Log Analysis
PPT
Eprints Special Session - DC-2006, Mexico
PDF
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
PDF
Perspectives on mining knowledge graphs from text
PDF
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
PPTX
Open Data Mashups: linking fragments into mosaics
PDF
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
PDF
Entities for Augmented Intelligence
PPTX
Democratizing Big Semantic Data management
PPTX
NLP in Practice - Part I
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PDF
NLP Data Cleansing Based on Linguistic Ontology Constraints
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Can Deep Learning Techniques Improve Entity Linking?
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Enhancing Entity Linking by Combining NER Models
Some Information Retrieval Models and Our Experiments for TREC KBA
NELL: The Never-Ending Language Learning System
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Improving Semantic Search Using Query Log Analysis
Eprints Special Session - DC-2006, Mexico
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
Perspectives on mining knowledge graphs from text
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Open Data Mashups: linking fragments into mosaics
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Entities for Augmented Intelligence
Democratizing Big Semantic Data management
NLP in Practice - Part I
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
NLP Data Cleansing Based on Linguistic Ontology Constraints

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Managing Community Partner Relationships
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Transcultural that can help you someday.
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Managing Community Partner Relationships
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Qualitative Qantitative and Mixed Methods.pptx
ISS -ESG Data flows What is ESG and HowHow
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
SAP 2 completion done . PRESENTATION.pptx
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Revealing Entities From Texts With a Hybrid Approach

  • 1. Julien Plu, Giuseppe Rizzo, Raphaël Troncy {firstname.lastname}@eurecom.fr, @julienplu, @giusepperizzo, @rtroncy Revealing Entities From Texts With a Hybrid Approach
  • 2. On June 21th, I went to Paris to see the Eiffel Tower and to enjoy the world music day. § Goal: link (or disambiguate) entity mentions one can find in text to their corresponding entries in a knowledge base (e.g. DBpedia) db:Paris db:Eiffel_Towerdb:Fête_de_la_Musiquedb:June_21 What is Entity Linking? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 2
  • 3. § Extract entities in diverse type of textual documents: Ø newspaper article, encyclopaedia article, micropost (tweet, status, photo caption), video subtitle, etc. Ø deal with grammar free and short texts that have littlecontext § Adapt what can be extracted depending on guidelines or challenges Ø #Micropost2014 NEEL challenge: link entities that may belong to: Person, Location, Organization, Function, Amount, Animal, Event, Product, Time, and Thing(languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) Ø OKE2015 challenge: extract and link entities that must belong to: Person, Location ,Organization, and Role Problems 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 3
  • 4. Research Question How do we adapt an entity linking system to solve these problems? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 4
  • 5. § Input and output to different formats: Ø Input: plain text, NIF, Micropost2014 (pruning phase) Ø Output: NIF, TAC (tsv format), Micropost2014 (tsv format with no offset) § Text is classified according to its provenance § Text is normalized if necessary For microposts content, RT symbols (in case of tweets) and emoticons are removed Text microposts newspaper article, video subtitle, encyclopaedia article, ... Text Normalization Entity Extractor Entity Linking index Pruning ADEL ADEL Workflow 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 5
  • 6. § Multiple extractors can be used: Ø Possibility to switch on and off an extractor in order to adapt the system to some guidelines Ø Extractors can be: Funsupervised: Dictionary, Hashtag + Mention, Number Extractor Fsupervised: Date Extractor, POS Tagger, NER System § Overlaps are resolved by choosing the longest extracted mention Date Extractor Number Extractor POS Tagger (NNP/NNPS) Dictionary NER System (Stanford) …. Hashtag + Mention Extractor Overlap Resolution Date Extractor: June 21 June 21 Number extractor: 21 Entity Extractor 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 6
  • 7. § From DBpedia: Ø PageRank Ø Title Ø Redirects, Disambiguation § From Wikipedia: Ø Anchors Ø Link references For example, from the EN Wikipedia article about Xabi Alonso: index (Arsenal F.C., 1);(Mikel Arteta, 2); (San Sebastián, 1);(Liverpool, 2); (Everton F.C., 1) Alonso and [[Arsenal F.C.|Arsenal]] player [[Mikel Arteta]] were neighbours on the same street while growing up in [[San Sebastián]] and also lived near each other in [[Liverpool]]. Alonso convinced [[Mikel Arteta|Arteta]] to transfer to [[Everton F.C.|Everton] after he told him how happy he was living in [[Liverpool]]]. How is the index created? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 7
  • 8. § Generate candidates from a fuzzy match to the index § Filter candidates: Ø Filter out candidates that are not semantically related to other entities from the same sentence § Score each candidate using a linear formula: score(cand) = (a * L(m, cand) + b * max(L(m, R(cand))) + c * max(L(m, D(cand)))) * PR(cand) L for Levenshtein distance, R for set of redirects, D for set of disambiguation and PR for PageRank a, b and c are weights set with a > b > c and a + b + c = 1 Candidate Generation Candidate Filtering Scoring mention index query Entity Linking 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 8
  • 9. Sentence: I went to Paris to see the Eiffel Tower. § Generate Candidates: Ø Paris: db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris Ø Eiffel Tower: db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee) § Filter candidates: Ø db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris Ø db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee) § Scoring: Ø Score(db:Paris)= (a * L(“Paris”, “Paris”) + b * max(L(“Paris”, R(“Parisien”, “Paname”))) + c * max(L(“Paris”, D(“Paris (disambiguation)”)))) * PR(db:Paris) Ø Score(db:Notre_Dame_de_Paris)= (a * L(“Paris”, “Notre Dame de Paris”) + b * max(L(“Paris”, R(“Nôtre Dame”, “Paris Cathedral”))) + c * max(L(“Paris”, D(“Notre Dame”, “Notre Dame de Paris (disambiguation)”)))) * PR(db:Notre_Dame_de_Paris) Entity Linking example 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 9
  • 10. § k-NN machine learning algorithm training process: Ø Run the system on a training set Ø Classify entities as true/false according to the training set Gold Standard Ø Create a file with the features of each entities and their true/false classification Ø Train k-NN with the previous file to get a model § Use 10 features for the training: • Length in number of characters • Extracted mention • Title • Type • PageRank • HITS • Number of inLinks • Number of outLinks • Redirects number • Linking score Training set ADEL Create file of features Train k-NN Pruning 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 10
  • 11. § Tweets dataset Ø Training set: 2340 tweets Ø Test set: 1165 tweets § Link entities that may belong to one of these ten types: Ø Person, Location, Organization, Function, Amount, Animal, Event, Product, Time, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) § Twitter user name dereferencing § Disambiguate in DBpedia 3.9 #Micropost2014 NEEL challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 11
  • 12. Results on #Micropost2014 § Results of ADEL with and without pruning § Results of other systems Without pruning With pruning Precision Recall F-measure Precision Recall F-measure Extraction 69.17 72.51 70.8 70 41.62 52.2 Linking 47.39 45.23 46.29 48.21 26.74 34.4 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 12 E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
  • 13. § Sentences from Wikipedia Ø Training set: 96 sentences Ø Test set: 101 sentences § Extract and link entities that must belong to one of these four types: Ø Person, Location, Organization and Role § Must disambiguate co-references § Allow emerging entities (NIL) § Disambiguate in DBpedia 3.9 OKE2015 challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 13
  • 14. Results on OKE2015 § Results of ADEL with and without pruning § Results of other systems https://github.com/anuzzolese/oke-challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 14 Without pruning With pruning Precision Recall F-measure Precision Recall F-measure Extraction 78.2 65.4 71.2 83.8 9.3 16.8 Recognition 65.8 54.8 59.8 75.7 8.4 15.1 Linking 49.4 46.6 48 57.9 6.2 11.1 ADEL FOX FRED F-measure 60.75 49.88 34.73
  • 15. #Micropost2015 NEEL challenge § Tweets dataset: Ø Training set: 3498 Ø Development set: 500 Ø Test set: 2027 § Extract and link entities that must belong to one of these seven types: Ø Person, Location, Organization, Character, Event, Product, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) § Twitter user name dereferencing § Disambiguate in DBpedia 3.9 + NIL 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 15
  • 16. Results on #Micropost2015 § Results of ADEL without pruning § Results of other systems Ø Strong type mention match Ø Strong link match (not considering the type correctness) Precision Recall F-measure Extraction 68.4 75.2 71.6 Recognition 62.8 45.5 52.8 Linking 48.8 47.1 47.9 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 16 ousia ADEL uva acubelab uniba ualberta cen_neel F-measure 80.7 52.8 41.2 38.8 36.7 32.9 0 ousia acubelab ADEL uniba ualberta uva cen_neel F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
  • 17. Error Analysis § Issue for the extraction: Ø “FB is a prime number.” FFB stands for 251 in hexadecimal and will be extracted as Facebook acronym by the wrong extractor § Issue for the filtering: Ø “The series of HP books have been sold million times in France.” FNo relation in Wikipedia between Harry Potter and France. Then no filtering is applied. § Issue for the scoring: Ø “The Spanish football player Alonso played twice for the national team between 1954 and 1960.” FXabi Alonso will be selected instead of Juan Alonso because of the PageRank. 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 17
  • 18. § Our system gives the possibility to adapt the entity linking task to different kind of text § Our system gives the possibility to adapt the type of extracted entities § Results are similar regardless of the kind of text § Performance at extraction stage similar to top state-of-the-art systems (or slightly better) § Big drop of performance at linking stage mainly due to an unsupervised approach Conclusion 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 18
  • 19. § Add more adaptive features: language, knowledge base § Improve linking by using a graph-based algorithm: Ø finding the common entities linked to each of the extracted entities Ø example: “Rafael Nadal is a friend of Alonso” . There is no existing direct link between Rafael Nadal and Alonso in DBpedia (or Wikipedia) but they have the entity Spain in common § Improve pruning by: Ø adding additional features: Frelatedness: compute the relation score between one entity and all the others in the text. If there are more than two, compute the average FPOS tag of the previous and the next token in the sentence Ø using other algorithms: FEnsemble Learning FUnsupervised Feature Learning + Deep Learning Future Work 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 19
  • 20. 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 20 http://www.slideshare.net/julienplu http://xkcd.com/1319/