SlideShare a Scribd company logo
ENTITY LINKING WITH A KNOWLEDGE BASE: ISSUES,
TECHNIQUES, AND SOLUTIONS
Abstract—The large number of potential applications from bridging web data
with knowledge bases have led to an increase in the entity linking research. Entity
linking is the task to link entity mentions in text with their corresponding entities in
a knowledge base. Potential applications include information extraction,
information retrieval, and knowledge base population. However, this task is
challenging due to name variations and entity ambiguity. In this survey, we present
a thorough overview and analysis of the main approaches to entity linking, and
discuss various applications, the evaluation of entity linking systems, and future
directions.
EXISTING SYSTEM:
The entity linking task is challenging due to name variations and entity ambiguity.
A named entity may have multiple surface forms, such as its full name, partial
names, aliases, abbreviations, and alternate spellings. For example, the named
entity of “Cornell University” has its abbreviation “Cornell” and the named entity
of “New York City” has its nickname “Big Apple”. An entity linking system has to
identify the correct mapping entities for entity mentions of various surface forms.
On the other hand, an entity mention could possibly denote different named
entities. For instance, the entity mention “Sun” can refer to the star at the center of
the Solar System, a multinational computer company, a fictional character named
“Sun-Hwa Kwon” on the ABC television series “Lost” or many other entities
which can be referred to as “Sun”. An entity linking system has to disambiguate
the entity mention in the textual context and identify the mapping entity for each
entity mention.
PROPOSED SYSTEM:
We have presented a comprehensive survey for entity linking. Specifically, we
have surveyed the main approaches utilized in the three modules of entity linking
systems (i.e., Candidate Entity Generation, Candidate Entity Ranking, and
Unlinkable Mention Prediction), and also introduced other critical aspects of entity
linking such as applications, features, and evaluation. Although there are so many
methods proposed to deal with entity linking, it is currently unclear which
techniques and systems are the current state-of-the-art, as these systems all differ
along multiple dimensions and are evaluated over different data sets. A single
entity linking system typically performs very differently for different data sets and
domains. Although the supervised ranking methods seem to perform much better
than the unsupervised approaches with respect to candidate entity ranking, the
overall performance of the entity linking system is also significantly influenced by
techniques adopted in the other two modules (i.e., Candidate Entity Generation and
Unlinkable Mention Prediction). Supervised techniques require many annotated
training examples and the task of annotating examples is costly. Furthermore, the
entity linking task is highly data dependent and it is unlikely a technique dominates
all others across all data sets. For a given entity linking task, it is difficult to
determine which techniques are best suited.
Module 1
Entity lining System
Entity linking system consists of the following three modules:
_ Candidate entity generation. In this module, for each entity mention m 2 M, the
entity linking system aims to filter out irrelevant entities in the knowledge base and
retrieve a candidate entity set Em which contains possible entities that entity
mention m may refer to. To achieve this goal, a variety of techniques have been
utilized by some state-of-the-art entity linking systems, such as name dictionary
based techniques, surface form expansion from the local document, and methods
based on search engine.
_ Candidate entity ranking. In most cases, the size of the candidate entity set Em is
larger than one. Researchers leverage different kinds of evidence to rank the
candidate entities in Em and try to find the entity e 2 Em which is the most likely
link for mention m. To deal with the problem of predicting unlinkable mentions,
some work leverages this module to validate whether the topranked entity
identified in the Candidate Entity Ranking module is the target entity for mention
m. Otherwise, they return NIL for mention m. An overview of the main
approaches for predicting unlinkable mentions.
Module 2
Candidate entity generation
Entity Generation module, for each entity mention m 2 M, entity linking systems
try to include possible entities that entity mention m may refer to in the set of
candidate entities Em. Approaches to candidate entity generation are mainly based
on string comparison between the surface form of the entity mention and the name
of the entity existing in a knowledge base. This module is as important as the
Candidate Entity Ranking module and critical for a successful entity linking
system according to the experiments conducted by Hachey et al. [33]. In the
remainder of this section, we review the main approaches that have been applied
for generating the candidate entity set Em for entity mention m.
Module 3
Name Dictionary BasedTechniques
Name dictionary based techniques are the main approaches to candidate entity
generation and are leveraged by many entity linking systems.The structure of
Wikipedia provides a set of useful features for generating candidate entities, such
as entity pages, redirect pages, disambiguation pages, bold phrases from the first
paragraphs, and hyperlinks in Wikipedia articles. These entity linking systems
leverage different combinations of these features to build an offline name
dictionary D between various names and their possible mapping entities, and
exploit this constructed name dictionary D to generate candidate entities. This
name dictionary D contains vast amount of information about various names of
named entities, like name variations, abbreviations, confusable names, spelling
variations, nicknames, etc. Specifically, the name dictionary D is a hkey, valuei
mapping, where the key column is a list of names. Suppose k is a name in the key
column, and its mapping value k:value in the value column is a set of named
entities which could be referred to as the name k. The dictionary D is constructed
by leveraging features from Wikipedia as follows:
_ Entity pages. Each entity page in Wikipedia describes a single entity and
contains the information focusing on this entity. Generally, the title of each page is
the most common name for the entity described in this page, e.g., the page title
“Microsoft” for that giant software company headquartered in Redmond. Thus, the
title of the entity page is added to the key column inD as a name k, and the entity
described in this page is addedas k:value.
_ Redirect pages. A redirect page exists for each alternative name which could be
used to refer to an existing entity in Wikipedia. For example, the article titled
“Microsoft Corporation” which is the full name of Microsoft contains a pointer to
the article of the entity Microsoft. Redirect pages often indicate synonym terms,
abbreviations, or other variations of the pointed entities. Therefore, the title of the
redirect page is added to the key column in D as a name , and the pointed entity is
added as k:value.
_ Disambiguation pages. When multiple entities in Wikipedia could be given the
same name, a disambiguation page is created to separate them and contains a list of
references to those entities. For example, the disambiguation page for the name
“Michael Jordan” lists thirteen associated entities having the same name of
“Michael Jordan” including the famous NBA player and the Berkeley professor.
These disambiguation pages are very useful in extracting abbreviations or other
aliases of entities. For each disambiguation page, the title of this page is added to
the key column in D as a name k, and the entities listed in this page are added as
k:value.
_ Bold phrases from the first paragraphs. In general, the first paragraph of a
Wikipedia article is a summary of the whole article. It sometimes contains a few
phrases written in bold. Varma et al. observed that these bold phrases invariably
are nick names, alias names or full names of the entity described in this paper. For
instance, in the first paragraph of the entity page of Hewlett-Packard (HP), there
are two phrases written in bold (i.e., “Hewlett-Packard Company” and “HP”)
which are respectively the full name and the abbreviation for the entity Hewlett-
Packard. Thus, for each of the bold phrases in the first paragraph of each
Wikipedia page, it is added to the key column in D as a name k, and the entity
described in this page is added as k:value.
_ Hyperlinks in Wikipedia articles. An article in Wikipedia often contains
hyperlinks which link to the pages of the entities mentioned in this article. The
anchor text of a link pointing to an entity page provides a very useful source of
synonyms and other name variations of the pointed entity, and could be regarded
as a name of that linked entity. For example, in the entity page of Hewlett-Packard,
there is a hyperlink pointing to the entity William Reddington Hewlett whose
anchor text is “Bill Hewlett”, which is an alias name of the entity William
Reddington Hewlett. Hence, the anchor text of the hyperlink is added to the key
column in D as a name k, and the pointed entity is added as k:value. Using these
features from Wikipedia described above, entity linking systems could construct a
dictionary D. Besides leveraging the features from Wikipedia, there are some
studies that exploit query click logs and web documents to find entity synonyms,
which are also helpful for the name dictionary construction.
Module 4
Surface Form Expansion from the Local Document
Since some entity mentions are acronyms or part of their full names, one category
of entity linking systems use the surface form expansion techniques to identify
other possible expanded variations (such as the full name) from the associated
document where the entity mention appears. Then they could leverage these
expanded forms to generate the candidate entity set using other methods such as
the name dictionary based techniques introduced above. We categorize the surface
form expansion techniques into the heuristic based methods and the supervised
learning methods.
Module 5
Candidate entity ranking
In the previous section, we described methods that could generate the candidate
entity set Em for each entity mention m. We denote the size of Em as jEmj, and
use 1 _ i _ jEmj to index the candidate entity in Em. The candidate entity with
index i in Em is denoted by ei. In most cases, the size of the candidate entity set
Em is larger than one. For instance, Ji et al. [89] showed that the average number
of candidate entities per entity mention on the TAC-KBP2010 data set is 12.9, and
this average number on the TAC-KBP2011 data set is 13.1. In addition, this
average number is 73 on the CoNLL data set utilized in [58]. Therefore, the
remaining problem is how to incorporate different kinds of evidence to rank the
candidate entities in Em and pick the proper entity from Em as the mapping entity
for the entity mention m. The Candidate Entity Ranking module is a key
component for the entity linking system. We can broadly divide these candidate
entity ranking methods into two categories:
_ Supervised ranking methods. These approaches rely on annotated training data
to “learn” how to rank the candidate entities in Em. These approaches include
binary classification methods, learning to rank methods, probabilistic methods, and
graph based approaches.
_ Unsupervised ranking methods. These approaches are based on unlabeled corpus
and do not require any manually annotated corpus to train the model. These
approaches include vector space model (VSM) based methods and information
retrieval based methods. In this section, all candidate entity ranking methods are
illustrated according to the above categorization. In addition, we could also
categorize the candidate entity ranking methods into another three categories:
_ Independent ranking methods. These approaches consider that entity mentions
which need to be linked in a document are independent, and do not leverage the
relations between the entity mentions in one document to help candidate entity
ranking. In order to rank the candidate entities, they mainly leverage the context
similarity between the text around the entity mention and the document associated
with the candidate entity.
_ Collective ranking methods. These methods assume that a document largely
refers to coherent entities from one or a few related topics, and entity assignments
for entity mentions in one document are interdependent with each other. Thus, in
these methods, entity mentions in one document are collectively linked by
exploiting this “topical coherence”.
_ Collaborative ranking methods. For an entity mention that needs to be linked,
these approaches identify other entity mentions having similar surface forms and
similar textual contexts in the other documents. They leverage this cross-document
extended context information obtained from the other similar entity mentions and
the context information of the entity mention itself to rank candidate entities for
the entity mention.
CONCLUSION:
In this paper, we have presented a comprehensive survey for entity linking.
Specifically, we have surveyed the main approaches utilized in the three modules
of entity linking systems (i.e., Candidate Entity Generation, Candidate Entity
Ranking, and Unlinkable Mention Prediction), and also introduced other critical
aspects of entity linking such as applications, features, and evaluation. Although
there are so many methods proposed to deal with entity linking, it is currently
unclear which techniques and systems are the current state-of-the-art, as these
systems all differ along multiple dimensions and are evaluated over different data
sets. A single entity linking system typically performs very differently for different
data sets and domains. Although the supervised ranking methods seem to perform
much better than the unsupervised approaches with respect to candidate entity
ranking, the overall performance of the entity linking system is also significantly
influenced by techniques adopted in the other two modules (i.e., Candidate Entity
Generation and Unlinkable Mention Prediction). Supervised techniques require
many annotated training examples and the task of annotating examples is costly.
Furthermore, the entity linking task is highly data dependent and it is unlikely a
technique dominates all others across all data sets. For a given entity linking task,
it is difficult to determine which techniques are best suited. There are many aspects
that affect the design of the entity linking system, such as the system requirement
and the characteristics of the data sets. Although our survey has presented many
efforts in entity linking, we believe that there are still many opportunities for
substantial improvement in this field. In the following, we point out some
promising research directions in entity linking.
REFERENCES
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives, “DBpedia: A nucleus
for a web of open data,” in Proc. 6th Int. Semantic Web 2nd Asian Conf. Asian
Semantic Web Conf., 2007, pp. 11–15.
[2] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic
knowledge unifying wordnet and wikipedia,” in Proc. 16th Int. Conf. World Wide
Web, 2007, pp. 697–706.
[3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: A
collaboratively created graph database for structuring human knowledge,” in Proc.
ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 1247–1250.
[4] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S.
Soderland, D. S. Weld, and A. ates, “Web-scale information extraction in
knowitall: (preliminary results),” in Proc. 13th Int. Conf. World Wide Web, 2004,
pp. 100–110.
[5] A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr, and T. M. Mitchell,
“Coupled semi-supervised learning for information extraction,” in Proc. 3rd ACM
Int. Conf. Web Search Data Mining, 2010, pp. 101–110.
[6] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxonomy
for text understanding,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2012,
pp. 481–492.
[7] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,” Sci. Am., vol.
284, pp. 34–43, 2001.
[8] E. Agichtein and L. Gravano, “Snowball: Extracting relations from large plain-
text collections,” in Proc. ACM Int. Conf. Digital Libraries, 2000, pp. 85–94.
[9] D. Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation
extraction,” J. Mach. Learn. Res., vol. 3, pp. 1083–1106, 2003.
[10] T. Hasegawa, S. Sekine, and R. Grishman, “Discovering relations among
named entities from large corpora,” in Proc. 42nd Ann. Meeting Assoc. Comput.
Linguistics, 2004, pp. 415–422.

More Related Content

PDF
Data Models & Introduction to UML
PPT
Citation Analysis for the Free, Online Literature
PPT
Future of Web 2.0 & The Semantic Web
PPT
Repositories thru the looking glass
PPTX
Database model
DOCX
Entity linking with a knowledge baseissues, techniques, and solutions
PPTX
Sem tech2013 tutorial
PDF
B131626
Data Models & Introduction to UML
Citation Analysis for the Free, Online Literature
Future of Web 2.0 & The Semantic Web
Repositories thru the looking glass
Database model
Entity linking with a knowledge baseissues, techniques, and solutions
Sem tech2013 tutorial
B131626

What's hot (20)

PDF
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
PDF
Data Convergence White Paper
PPTX
Sql server ___________session2-data_modeling
PDF
ICICCE0280
PDF
Scholars@Cornell: Visualizing the Scholarship Data
PPTX
Introduction to the Semantic Web
PDF
Open Data Convergence
PPT
SWAP : A Dublin Core Application Profile for desribing scholarly works
PDF
G5234552
PDF
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
PPT
Electronic Databases
PDF
Graph-based Approaches for Organization Entity Resolution in MapReduce
PDF
Query- And User-Dependent Approach for Ranking Query Results in Web Databases
PDF
The impact of domain-specific stop-word lists on ecommerce website search per...
PPTX
Components of a search engine
KEY
Structure in Basic Web Design
PDF
Paper id 24201441
PPT
Search Systems
PPT
DC-dot
PDF
Lexical Pattern- Based Approach for Extracting Name Aliases
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
Data Convergence White Paper
Sql server ___________session2-data_modeling
ICICCE0280
Scholars@Cornell: Visualizing the Scholarship Data
Introduction to the Semantic Web
Open Data Convergence
SWAP : A Dublin Core Application Profile for desribing scholarly works
G5234552
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
Electronic Databases
Graph-based Approaches for Organization Entity Resolution in MapReduce
Query- And User-Dependent Approach for Ranking Query Results in Web Databases
The impact of domain-specific stop-word lists on ecommerce website search per...
Components of a search engine
Structure in Basic Web Design
Paper id 24201441
Search Systems
DC-dot
Lexical Pattern- Based Approach for Extracting Name Aliases
Ad

Viewers also liked (16)

DOCX
Powerelectronics2015 16
DOCX
ACE–AN EFFECTIVE ANTI-FORENSIC CONTRAST ENHANCEMENT TECHNIQUE
PDF
AUTOMATIC DESIGN OF COLOR FILTER ARRAYS IN THE FREQUENCY DOMAIN
DOCX
ADAPTIVE PART-LEVEL MODEL KNOWLEDGE TRANSFER FOR GENDER CLASSIFICATION
PPTX
An aggregately name based routing for energy-efficient data sharing in big da...
DOCX
Ns2 2015 2016 titles abstract
DOCX
bulk ieee projects in pondicherry,best ieee projects in pondicherry,ieee pro...
DOCX
Subgraph matching with set similarity in a
DOCX
ANALYSIS OF ADAPTIVE FILTER AND ICA FOR NOISECANCELLATION FROM A VIDEO FRAME
DOCX
Ieee 2015 2014 nexgen tech vlsi abstract
PDF
USER-DEFINED PRIVACY GRID SYSTEM FOR CONTINUOUS LOCATION-BASED SERVICES - IEE...
DOCX
A COMPUTATIONAL DYNAMIC TRUST MODEL FOR USER AUTHORIZATION - IEEE PROJECTS I...
DOCX
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
DOCX
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
DOCX
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
PDF
Projet de loi Enseignement Supérieur et Recherche en Commission Mixte Paritaire
Powerelectronics2015 16
ACE–AN EFFECTIVE ANTI-FORENSIC CONTRAST ENHANCEMENT TECHNIQUE
AUTOMATIC DESIGN OF COLOR FILTER ARRAYS IN THE FREQUENCY DOMAIN
ADAPTIVE PART-LEVEL MODEL KNOWLEDGE TRANSFER FOR GENDER CLASSIFICATION
An aggregately name based routing for energy-efficient data sharing in big da...
Ns2 2015 2016 titles abstract
bulk ieee projects in pondicherry,best ieee projects in pondicherry,ieee pro...
Subgraph matching with set similarity in a
ANALYSIS OF ADAPTIVE FILTER AND ICA FOR NOISECANCELLATION FROM A VIDEO FRAME
Ieee 2015 2014 nexgen tech vlsi abstract
USER-DEFINED PRIVACY GRID SYSTEM FOR CONTINUOUS LOCATION-BASED SERVICES - IEE...
A COMPUTATIONAL DYNAMIC TRUST MODEL FOR USER AUTHORIZATION - IEEE PROJECTS I...
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
Projet de loi Enseignement Supérieur et Recherche en Commission Mixte Paritaire
Ad

Similar to Entity linking with a knowledge base issues, (20)

DOCX
Entity linking with a knowledge base issues techniques and solutions
PDF
Spotlight
PPT
George thomas gtra2010
PPTX
Chapter 3.pptxoop presentation goods one
PDF
DATABASE DESIGNS ER DIAGRAMS REATIONA; ALGEBRA
PDF
Relational data base and Er diagema Normalization
PDF
Named Entity Recognition using Tweet Segmentation
PPT
Understanding Seo At A Glance
PPT
A Dublin Core Application Profile for Scholarly Works (eprints)
PPT
The Eprints Application Profile: a FRBR approach to modelling repository meta...
PDF
Cluster Based Web Search Using Support Vector Machine
PDF
Annotating Search Results from Web Databases
PPTX
Information_retrieval_and_extraction_IIIT
PDF
DBMS_Chapter3mklkjjkhgffgjtdjdffgfygyfty.pdf
PPTX
Unit-1-DBMS-SUN-4 everything you need to know.pptx
PDF
Co-Extracting Opinions from Online Reviews
PDF
Entity Linking
PPTX
Information retrieval and extraction
PPT
software_engg-chap-03.ppt
PDF
Adaptive named entity recognition for social network analysis and domain onto...
Entity linking with a knowledge base issues techniques and solutions
Spotlight
George thomas gtra2010
Chapter 3.pptxoop presentation goods one
DATABASE DESIGNS ER DIAGRAMS REATIONA; ALGEBRA
Relational data base and Er diagema Normalization
Named Entity Recognition using Tweet Segmentation
Understanding Seo At A Glance
A Dublin Core Application Profile for Scholarly Works (eprints)
The Eprints Application Profile: a FRBR approach to modelling repository meta...
Cluster Based Web Search Using Support Vector Machine
Annotating Search Results from Web Databases
Information_retrieval_and_extraction_IIIT
DBMS_Chapter3mklkjjkhgffgjtdjdffgfygyfty.pdf
Unit-1-DBMS-SUN-4 everything you need to know.pptx
Co-Extracting Opinions from Online Reviews
Entity Linking
Information retrieval and extraction
software_engg-chap-03.ppt
Adaptive named entity recognition for social network analysis and domain onto...

More from Nexgen Technology (20)

DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CH...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHENN...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
DOCX
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHENNA...
DOCX
Ieee 2020 21 vlsi projects in pondicherry,ieee vlsi projects in chennai
DOCX
Ieee 2020 21 power electronics in pondicherry,Ieee 2020 21 power electronics
DOCX
Ieee 2020 -21 ns2 in pondicherry, Ieee 2020 -21 ns2 projects,best project cen...
DOCX
Ieee 2020 21 ns2 in pondicherry,best project center in pondicherry,final year...
DOCX
Ieee 2020 21 java dotnet in pondicherry,final year projects in pondicherry,pr...
DOCX
Ieee 2020 21 iot in pondicherry,final year projects in pondicherry,project ce...
DOCX
Ieee 2020 21 blockchain in pondicherry,final year projects in pondicherry,bes...
DOCX
Ieee 2020 -21 bigdata in pondicherry,project center in pondicherry,best proje...
DOCX
Ieee 2020 21 embedded in pondicherry,final year projects in pondicherry,best...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CH...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHENN...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHE...
MECHANICAL PROJECTS IN PONDICHERRY, 2020-21 MECHANICAL PROJECTS IN CHENNA...
Ieee 2020 21 vlsi projects in pondicherry,ieee vlsi projects in chennai
Ieee 2020 21 power electronics in pondicherry,Ieee 2020 21 power electronics
Ieee 2020 -21 ns2 in pondicherry, Ieee 2020 -21 ns2 projects,best project cen...
Ieee 2020 21 ns2 in pondicherry,best project center in pondicherry,final year...
Ieee 2020 21 java dotnet in pondicherry,final year projects in pondicherry,pr...
Ieee 2020 21 iot in pondicherry,final year projects in pondicherry,project ce...
Ieee 2020 21 blockchain in pondicherry,final year projects in pondicherry,bes...
Ieee 2020 -21 bigdata in pondicherry,project center in pondicherry,best proje...
Ieee 2020 21 embedded in pondicherry,final year projects in pondicherry,best...

Entity linking with a knowledge base issues,

  • 1. ENTITY LINKING WITH A KNOWLEDGE BASE: ISSUES, TECHNIQUES, AND SOLUTIONS Abstract—The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions. EXISTING SYSTEM: The entity linking task is challenging due to name variations and entity ambiguity. A named entity may have multiple surface forms, such as its full name, partial names, aliases, abbreviations, and alternate spellings. For example, the named entity of “Cornell University” has its abbreviation “Cornell” and the named entity of “New York City” has its nickname “Big Apple”. An entity linking system has to identify the correct mapping entities for entity mentions of various surface forms.
  • 2. On the other hand, an entity mention could possibly denote different named entities. For instance, the entity mention “Sun” can refer to the star at the center of the Solar System, a multinational computer company, a fictional character named “Sun-Hwa Kwon” on the ABC television series “Lost” or many other entities which can be referred to as “Sun”. An entity linking system has to disambiguate the entity mention in the textual context and identify the mapping entity for each entity mention. PROPOSED SYSTEM: We have presented a comprehensive survey for entity linking. Specifically, we have surveyed the main approaches utilized in the three modules of entity linking systems (i.e., Candidate Entity Generation, Candidate Entity Ranking, and Unlinkable Mention Prediction), and also introduced other critical aspects of entity linking such as applications, features, and evaluation. Although there are so many methods proposed to deal with entity linking, it is currently unclear which techniques and systems are the current state-of-the-art, as these systems all differ along multiple dimensions and are evaluated over different data sets. A single entity linking system typically performs very differently for different data sets and domains. Although the supervised ranking methods seem to perform much better than the unsupervised approaches with respect to candidate entity ranking, the overall performance of the entity linking system is also significantly influenced by
  • 3. techniques adopted in the other two modules (i.e., Candidate Entity Generation and Unlinkable Mention Prediction). Supervised techniques require many annotated training examples and the task of annotating examples is costly. Furthermore, the entity linking task is highly data dependent and it is unlikely a technique dominates all others across all data sets. For a given entity linking task, it is difficult to determine which techniques are best suited. Module 1 Entity lining System Entity linking system consists of the following three modules: _ Candidate entity generation. In this module, for each entity mention m 2 M, the entity linking system aims to filter out irrelevant entities in the knowledge base and retrieve a candidate entity set Em which contains possible entities that entity mention m may refer to. To achieve this goal, a variety of techniques have been utilized by some state-of-the-art entity linking systems, such as name dictionary based techniques, surface form expansion from the local document, and methods based on search engine. _ Candidate entity ranking. In most cases, the size of the candidate entity set Em is larger than one. Researchers leverage different kinds of evidence to rank the candidate entities in Em and try to find the entity e 2 Em which is the most likely link for mention m. To deal with the problem of predicting unlinkable mentions,
  • 4. some work leverages this module to validate whether the topranked entity identified in the Candidate Entity Ranking module is the target entity for mention m. Otherwise, they return NIL for mention m. An overview of the main approaches for predicting unlinkable mentions. Module 2 Candidate entity generation Entity Generation module, for each entity mention m 2 M, entity linking systems try to include possible entities that entity mention m may refer to in the set of candidate entities Em. Approaches to candidate entity generation are mainly based on string comparison between the surface form of the entity mention and the name of the entity existing in a knowledge base. This module is as important as the Candidate Entity Ranking module and critical for a successful entity linking system according to the experiments conducted by Hachey et al. [33]. In the remainder of this section, we review the main approaches that have been applied for generating the candidate entity set Em for entity mention m. Module 3 Name Dictionary BasedTechniques
  • 5. Name dictionary based techniques are the main approaches to candidate entity generation and are leveraged by many entity linking systems.The structure of Wikipedia provides a set of useful features for generating candidate entities, such as entity pages, redirect pages, disambiguation pages, bold phrases from the first paragraphs, and hyperlinks in Wikipedia articles. These entity linking systems leverage different combinations of these features to build an offline name dictionary D between various names and their possible mapping entities, and exploit this constructed name dictionary D to generate candidate entities. This name dictionary D contains vast amount of information about various names of named entities, like name variations, abbreviations, confusable names, spelling variations, nicknames, etc. Specifically, the name dictionary D is a hkey, valuei mapping, where the key column is a list of names. Suppose k is a name in the key column, and its mapping value k:value in the value column is a set of named entities which could be referred to as the name k. The dictionary D is constructed by leveraging features from Wikipedia as follows: _ Entity pages. Each entity page in Wikipedia describes a single entity and contains the information focusing on this entity. Generally, the title of each page is the most common name for the entity described in this page, e.g., the page title “Microsoft” for that giant software company headquartered in Redmond. Thus, the
  • 6. title of the entity page is added to the key column inD as a name k, and the entity described in this page is addedas k:value. _ Redirect pages. A redirect page exists for each alternative name which could be used to refer to an existing entity in Wikipedia. For example, the article titled “Microsoft Corporation” which is the full name of Microsoft contains a pointer to the article of the entity Microsoft. Redirect pages often indicate synonym terms, abbreviations, or other variations of the pointed entities. Therefore, the title of the redirect page is added to the key column in D as a name , and the pointed entity is added as k:value. _ Disambiguation pages. When multiple entities in Wikipedia could be given the same name, a disambiguation page is created to separate them and contains a list of references to those entities. For example, the disambiguation page for the name “Michael Jordan” lists thirteen associated entities having the same name of “Michael Jordan” including the famous NBA player and the Berkeley professor. These disambiguation pages are very useful in extracting abbreviations or other aliases of entities. For each disambiguation page, the title of this page is added to the key column in D as a name k, and the entities listed in this page are added as k:value. _ Bold phrases from the first paragraphs. In general, the first paragraph of a Wikipedia article is a summary of the whole article. It sometimes contains a few
  • 7. phrases written in bold. Varma et al. observed that these bold phrases invariably are nick names, alias names or full names of the entity described in this paper. For instance, in the first paragraph of the entity page of Hewlett-Packard (HP), there are two phrases written in bold (i.e., “Hewlett-Packard Company” and “HP”) which are respectively the full name and the abbreviation for the entity Hewlett- Packard. Thus, for each of the bold phrases in the first paragraph of each Wikipedia page, it is added to the key column in D as a name k, and the entity described in this page is added as k:value. _ Hyperlinks in Wikipedia articles. An article in Wikipedia often contains hyperlinks which link to the pages of the entities mentioned in this article. The anchor text of a link pointing to an entity page provides a very useful source of synonyms and other name variations of the pointed entity, and could be regarded as a name of that linked entity. For example, in the entity page of Hewlett-Packard, there is a hyperlink pointing to the entity William Reddington Hewlett whose anchor text is “Bill Hewlett”, which is an alias name of the entity William Reddington Hewlett. Hence, the anchor text of the hyperlink is added to the key column in D as a name k, and the pointed entity is added as k:value. Using these features from Wikipedia described above, entity linking systems could construct a dictionary D. Besides leveraging the features from Wikipedia, there are some
  • 8. studies that exploit query click logs and web documents to find entity synonyms, which are also helpful for the name dictionary construction. Module 4 Surface Form Expansion from the Local Document Since some entity mentions are acronyms or part of their full names, one category of entity linking systems use the surface form expansion techniques to identify other possible expanded variations (such as the full name) from the associated document where the entity mention appears. Then they could leverage these expanded forms to generate the candidate entity set using other methods such as the name dictionary based techniques introduced above. We categorize the surface form expansion techniques into the heuristic based methods and the supervised learning methods. Module 5 Candidate entity ranking In the previous section, we described methods that could generate the candidate entity set Em for each entity mention m. We denote the size of Em as jEmj, and use 1 _ i _ jEmj to index the candidate entity in Em. The candidate entity with index i in Em is denoted by ei. In most cases, the size of the candidate entity set
  • 9. Em is larger than one. For instance, Ji et al. [89] showed that the average number of candidate entities per entity mention on the TAC-KBP2010 data set is 12.9, and this average number on the TAC-KBP2011 data set is 13.1. In addition, this average number is 73 on the CoNLL data set utilized in [58]. Therefore, the remaining problem is how to incorporate different kinds of evidence to rank the candidate entities in Em and pick the proper entity from Em as the mapping entity for the entity mention m. The Candidate Entity Ranking module is a key component for the entity linking system. We can broadly divide these candidate entity ranking methods into two categories: _ Supervised ranking methods. These approaches rely on annotated training data to “learn” how to rank the candidate entities in Em. These approaches include binary classification methods, learning to rank methods, probabilistic methods, and graph based approaches. _ Unsupervised ranking methods. These approaches are based on unlabeled corpus and do not require any manually annotated corpus to train the model. These approaches include vector space model (VSM) based methods and information retrieval based methods. In this section, all candidate entity ranking methods are illustrated according to the above categorization. In addition, we could also categorize the candidate entity ranking methods into another three categories:
  • 10. _ Independent ranking methods. These approaches consider that entity mentions which need to be linked in a document are independent, and do not leverage the relations between the entity mentions in one document to help candidate entity ranking. In order to rank the candidate entities, they mainly leverage the context similarity between the text around the entity mention and the document associated with the candidate entity. _ Collective ranking methods. These methods assume that a document largely refers to coherent entities from one or a few related topics, and entity assignments for entity mentions in one document are interdependent with each other. Thus, in these methods, entity mentions in one document are collectively linked by exploiting this “topical coherence”. _ Collaborative ranking methods. For an entity mention that needs to be linked, these approaches identify other entity mentions having similar surface forms and similar textual contexts in the other documents. They leverage this cross-document extended context information obtained from the other similar entity mentions and the context information of the entity mention itself to rank candidate entities for the entity mention. CONCLUSION:
  • 11. In this paper, we have presented a comprehensive survey for entity linking. Specifically, we have surveyed the main approaches utilized in the three modules of entity linking systems (i.e., Candidate Entity Generation, Candidate Entity Ranking, and Unlinkable Mention Prediction), and also introduced other critical aspects of entity linking such as applications, features, and evaluation. Although there are so many methods proposed to deal with entity linking, it is currently unclear which techniques and systems are the current state-of-the-art, as these systems all differ along multiple dimensions and are evaluated over different data sets. A single entity linking system typically performs very differently for different data sets and domains. Although the supervised ranking methods seem to perform much better than the unsupervised approaches with respect to candidate entity ranking, the overall performance of the entity linking system is also significantly influenced by techniques adopted in the other two modules (i.e., Candidate Entity Generation and Unlinkable Mention Prediction). Supervised techniques require many annotated training examples and the task of annotating examples is costly. Furthermore, the entity linking task is highly data dependent and it is unlikely a technique dominates all others across all data sets. For a given entity linking task, it is difficult to determine which techniques are best suited. There are many aspects that affect the design of the entity linking system, such as the system requirement and the characteristics of the data sets. Although our survey has presented many
  • 12. efforts in entity linking, we believe that there are still many opportunities for substantial improvement in this field. In the following, we point out some promising research directions in entity linking. REFERENCES [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives, “DBpedia: A nucleus for a web of open data,” in Proc. 6th Int. Semantic Web 2nd Asian Conf. Asian Semantic Web Conf., 2007, pp. 11–15. [2] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic knowledge unifying wordnet and wikipedia,” in Proc. 16th Int. Conf. World Wide Web, 2007, pp. 697–706. [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 1247–1250. [4] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. ates, “Web-scale information extraction in knowitall: (preliminary results),” in Proc. 13th Int. Conf. World Wide Web, 2004, pp. 100–110. [5] A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr, and T. M. Mitchell, “Coupled semi-supervised learning for information extraction,” in Proc. 3rd ACM Int. Conf. Web Search Data Mining, 2010, pp. 101–110.
  • 13. [6] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxonomy for text understanding,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2012, pp. 481–492. [7] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,” Sci. Am., vol. 284, pp. 34–43, 2001. [8] E. Agichtein and L. Gravano, “Snowball: Extracting relations from large plain- text collections,” in Proc. ACM Int. Conf. Digital Libraries, 2000, pp. 85–94. [9] D. Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation extraction,” J. Mach. Learn. Res., vol. 3, pp. 1083–1106, 2003. [10] T. Hasegawa, S. Sekine, and R. Grishman, “Discovering relations among named entities from large corpora,” in Proc. 42nd Ann. Meeting Assoc. Comput. Linguistics, 2004, pp. 415–422.