SlideShare a Scribd company logo
MINING NAME ENTITY FROM
WIKIPEDIA
GROUP MEMBER
- NIKHIL BAROTE
- KUNJ THAKKAR
- SHIVANI PODDAR
- ANKIT SHARMA
 In many search domains, both contents and searches are
frequently tied to named entities such as a person, a
company or similar.
 One challenge from an information retrieval point of view is
that a single entity can have more than one way of referring
to it.
 In this project we describe how to use Wikipedia contents
to automatically generate a dictionary of named entities
and synonyms that are all referring to the same entity.
 we can find named entities and their synonyms with a high
degree of accuracy with our approach.
 There are four Wikipedia features that are in particular
attractive as a mining source when building a large
collection of NEs:
1.INTERNAL LINKS
2.REDIRECT LINKS
3.EXTERNAL LINKS
4.CATEGORIES
 Generic Named Entity Recognition
The generic named entity recognition is only classifying a Wikipedia entry
as an entity or not. It starts out by looking at the title of the entry, since as
mentioned earlier, most of the article titles are nouns, and the only nouns
we are interested in are the proper nouns.
 Category Based Named-Entity Recognition
It is a subtask of information extraction that seeks to locate and classify
elements in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.
 Synonym extraction
After a set of NEs have been identified, we want to find their synonyms.
We intend to use the internal links, redirects and disambiguation pages
for this, and we can easily extract all of these after we have the NEs.
This will give us a list of captions, all used on links to a particular entity.
 Generic Named Entity Recognition Algorithm
To classify the entries we implemented an algorithm using the
following steps when given a title, T, and the text of an entry:
1. Remove any domain suffix from T
2. Tokenize T into n units, w1;w2; :::;wn
3. Remove any wi from W where wi is included in S
4. Classify as an entity if any of these conditions holds
true:
• ∑ C(wi) = n and n >= 2
• ∑ D(wi) >= 2
• ∑ E(T)/N(T) >= α
 A domain suffix is the text enclosed in parentheses that follows
the title of entries with multiple senses.
 They are used to disambiguate between the senses, but
since they are not part of the Extracting entity name, we
must first strip them from the title. Next we strip all wi
which are found in S, which is a list of stop words.
1. C=1 if any li ∊ [A::Z], 0 otherwise
2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise
3. D returns 1 if the parameter has multiple capital
letters, 0 otherwise C is a function that returns 1 if the
parameter is capitalized, and 0 otherwise, while D is a
function that that returns 1 if the parameter has
multiple capital letters, and 0 otherwise. a is a variable
used as a threshold for the third condition.
Search System
 First we take unigrams , bigrams & trigrams from our query
document
 We look for them in our synonym database & We will get a
list of doc_titles & corresponding doc_ids.
 Now we look for words in window centered at current
word And we look at candidate documents & their doc_ids
(window size is set beforehand).
 We use vector space model to match our query document
to these candidates.
 We pick candidates with score greater than already set
threshold.Now we look for category for these entities in our
database
Information_retrieval_and_extraction_IIIT
 Zesch et al. evaluate the usefulness of Wikipedia as a lexical
semantic resource, and compares it to more traditional
resources, such as dictionaries, thesauri, semantic wordnets, etc.
 Bunescu and Pa¸sca study how to use Wikipedia for detecting
and disambiguating NEs in open domain text.
 R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for
named entity disambiguation. In Proceedings of
EACL’2006, 2006.
 R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically
annotated Wikipedia XML corpus. In Proceedings of
BTW’2007, 2007.
 T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and
accessing Wikipedia as a lexical semantic resource. In
Proceedings of Biannual Conference of the Society for
Computational Linguistics and Language Technology, 2007.
 R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.
THANK YOU!

More Related Content

PPTX
Information retrieval and extraction
PPTX
Keyword searching idc
PDF
Inverted files for text search engines
PPTX
MS SQL Server Full-Text Search
PPT
Database structure
PPT
PPTX
What can corpus software do? Routledge chpt 11
PPTX
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
Information retrieval and extraction
Keyword searching idc
Inverted files for text search engines
MS SQL Server Full-Text Search
Database structure
What can corpus software do? Routledge chpt 11
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.

What's hot (19)

PDF
Scalable Text Mining
PPT
Using Hyperlinks to Enrich Message Board Content with Linked Data
PPT
Phrase Based Indexing
PDF
New website ATLA religion database with serials
PPTX
Authority Control Part 1
ODP
The OpenOffice.org ODF Toolkit Project
PDF
Survey On Building A Database Driven Reverse Dictionary
PPT
4. search technique jun2012
PPTX
Psyc INFO database presentation
PPTX
Oles Petriv “Creating one concept embedding space for persons, brands and new...
PPTX
Authority Control Part II
PPT
Role of Text Mining in Search Engine
PPTX
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
PPTX
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
PPT
AIRDIP model overview
DOCX
Module pie 13 (aj mallari)
PPTX
ElasticSearch Basics
PPT
Electronic Databases
ODP
The search engine index
Scalable Text Mining
Using Hyperlinks to Enrich Message Board Content with Linked Data
Phrase Based Indexing
New website ATLA religion database with serials
Authority Control Part 1
The OpenOffice.org ODF Toolkit Project
Survey On Building A Database Driven Reverse Dictionary
4. search technique jun2012
Psyc INFO database presentation
Oles Petriv “Creating one concept embedding space for persons, brands and new...
Authority Control Part II
Role of Text Mining in Search Engine
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
AIRDIP model overview
Module pie 13 (aj mallari)
ElasticSearch Basics
Electronic Databases
The search engine index
Ad

Viewers also liked (20)

PDF
A survey of_eigenvector_methods_for_web_information_retrieval
PPTX
INTRODUCTION INFORMATION RETRIEVAL EVALUVATION
PDF
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
PPT
Web Information Extraction Learning based on Probabilistic Graphical Models
PDF
Multimodal Information Extraction: Disease, Date and Location Retrieval
ODP
Web Information Retrieval and Mining
PPTX
Mining Product Synonyms - Slides
PDF
Group-13 Project 15 Sub event detection on social media
PDF
IRE- Algorithm Name Detection in Research Papers
PDF
System for-health-diagnosis
PPT
Information extraction for Free Text
PDF
Open Information Extraction 2nd
PDF
Information Retrieval and Extraction
PPTX
Algorithm Name Detection & Extraction
PPTX
Ppt evaluation of information retrieval system
PDF
ATI Courses Professional Development Short Course Remote Sensing Information ...
PPTX
PDF
Information Extraction with UIMA - Usecases
ODP
Information Extraction from the Web - Algorithms and Tools
A survey of_eigenvector_methods_for_web_information_retrieval
INTRODUCTION INFORMATION RETRIEVAL EVALUVATION
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
Web Information Extraction Learning based on Probabilistic Graphical Models
Multimodal Information Extraction: Disease, Date and Location Retrieval
Web Information Retrieval and Mining
Mining Product Synonyms - Slides
Group-13 Project 15 Sub event detection on social media
IRE- Algorithm Name Detection in Research Papers
System for-health-diagnosis
Information extraction for Free Text
Open Information Extraction 2nd
Information Retrieval and Extraction
Algorithm Name Detection & Extraction
Ppt evaluation of information retrieval system
ATI Courses Professional Development Short Course Remote Sensing Information ...
Information Extraction with UIMA - Usecases
Information Extraction from the Web - Algorithms and Tools
Ad

Similar to Information_retrieval_and_extraction_IIIT (20)

PDF
Named entity recognition using web document corpus
PDF
Named Entity Recognition Using Web Document Corpus
PPTX
Understanding Queries through Entities
PDF
Entity Linking
PDF
Mining named entities -IIITH
PDF
Domain Specific Named Entity Recognition Using Supervised Approach
PDF
Hlava, Davis, Corson-Rikert, and Parr "Control Your Vocabulary: Real-World A...
PDF
Perspectives on mining knowledge graphs from text
PDF
58903230-SentiMatrix-Named-Entity-Recognition-for-Romanian-Language
PPTX
NLP & DBpedia
PDF
Usage of word sense disambiguation in concept identification in ontology cons...
PPSX
Semantic Analysis using Wikipedia Taxonomy
PPTX
Knowledge acquisition using automated techniques
PDF
Named Entity Recognition from Online News
DOCX
Entity linking with a knowledge base issues,
PDF
Multilingual Ner Using Wiki
PPTX
2015 07-tuto2-clus type
PDF
Research: Developing an Interactive Web Information Retrieval and Visualizati...
PPT
Pratt Sils LIS653 4 Fall 2007
Named entity recognition using web document corpus
Named Entity Recognition Using Web Document Corpus
Understanding Queries through Entities
Entity Linking
Mining named entities -IIITH
Domain Specific Named Entity Recognition Using Supervised Approach
Hlava, Davis, Corson-Rikert, and Parr "Control Your Vocabulary: Real-World A...
Perspectives on mining knowledge graphs from text
58903230-SentiMatrix-Named-Entity-Recognition-for-Romanian-Language
NLP & DBpedia
Usage of word sense disambiguation in concept identification in ontology cons...
Semantic Analysis using Wikipedia Taxonomy
Knowledge acquisition using automated techniques
Named Entity Recognition from Online News
Entity linking with a knowledge base issues,
Multilingual Ner Using Wiki
2015 07-tuto2-clus type
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Pratt Sils LIS653 4 Fall 2007

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Machine Learning_overview_presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Tartificialntelligence_presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Machine Learning_overview_presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Tartificialntelligence_presentation.pptx
Electronic commerce courselecture one. Pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Accuracy of neural networks in brain wave diagnosis of schizophrenia

Information_retrieval_and_extraction_IIIT

  • 1. MINING NAME ENTITY FROM WIKIPEDIA GROUP MEMBER - NIKHIL BAROTE - KUNJ THAKKAR - SHIVANI PODDAR - ANKIT SHARMA
  • 2.  In many search domains, both contents and searches are frequently tied to named entities such as a person, a company or similar.  One challenge from an information retrieval point of view is that a single entity can have more than one way of referring to it.  In this project we describe how to use Wikipedia contents to automatically generate a dictionary of named entities and synonyms that are all referring to the same entity.  we can find named entities and their synonyms with a high degree of accuracy with our approach.
  • 3.  There are four Wikipedia features that are in particular attractive as a mining source when building a large collection of NEs: 1.INTERNAL LINKS 2.REDIRECT LINKS 3.EXTERNAL LINKS 4.CATEGORIES
  • 4.  Generic Named Entity Recognition The generic named entity recognition is only classifying a Wikipedia entry as an entity or not. It starts out by looking at the title of the entry, since as mentioned earlier, most of the article titles are nouns, and the only nouns we are interested in are the proper nouns.  Category Based Named-Entity Recognition It is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.  Synonym extraction After a set of NEs have been identified, we want to find their synonyms. We intend to use the internal links, redirects and disambiguation pages for this, and we can easily extract all of these after we have the NEs. This will give us a list of captions, all used on links to a particular entity.
  • 5.  Generic Named Entity Recognition Algorithm To classify the entries we implemented an algorithm using the following steps when given a title, T, and the text of an entry: 1. Remove any domain suffix from T 2. Tokenize T into n units, w1;w2; :::;wn 3. Remove any wi from W where wi is included in S 4. Classify as an entity if any of these conditions holds true: • ∑ C(wi) = n and n >= 2 • ∑ D(wi) >= 2 • ∑ E(T)/N(T) >= α  A domain suffix is the text enclosed in parentheses that follows the title of entries with multiple senses.
  • 6.  They are used to disambiguate between the senses, but since they are not part of the Extracting entity name, we must first strip them from the title. Next we strip all wi which are found in S, which is a list of stop words. 1. C=1 if any li ∊ [A::Z], 0 otherwise 2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise 3. D returns 1 if the parameter has multiple capital letters, 0 otherwise C is a function that returns 1 if the parameter is capitalized, and 0 otherwise, while D is a function that that returns 1 if the parameter has multiple capital letters, and 0 otherwise. a is a variable used as a threshold for the third condition.
  • 7. Search System  First we take unigrams , bigrams & trigrams from our query document  We look for them in our synonym database & We will get a list of doc_titles & corresponding doc_ids.  Now we look for words in window centered at current word And we look at candidate documents & their doc_ids (window size is set beforehand).  We use vector space model to match our query document to these candidates.  We pick candidates with score greater than already set threshold.Now we look for category for these entities in our database
  • 9.  Zesch et al. evaluate the usefulness of Wikipedia as a lexical semantic resource, and compares it to more traditional resources, such as dictionaries, thesauri, semantic wordnets, etc.  Bunescu and Pa¸sca study how to use Wikipedia for detecting and disambiguating NEs in open domain text.
  • 10.  R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL’2006, 2006.  R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically annotated Wikipedia XML corpus. In Proceedings of BTW’2007, 2007.  T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and accessing Wikipedia as a lexical semantic resource. In Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, 2007.  R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.