SlideShare a Scribd company logo
Named Entity Recognition
Sobha Lalitha Devi
AU-KBC Research Centre
Chennai
Named Entity(NE) Recognition
• What is NE and What is not an NE
• How to identify NE
• Tagset and Annotation Guidelines
• Methods Used in developing NER
03/04/25 2
IIIT Summer School
Why do NER?
• Key part of Information Extraction system
• Robust handling of proper names essential for
many applications such as Summarization, IR,
Anaphora,.........
• Pre-processing for different classification
levels
• Information filtering
• Information linking
03/04/25 3
IIIT Summer School
What is NER ?
• NER involves identification of proper names in
texts, and classification into a set of predefined
categories of interest.
• Three universally accepted categories:
• Person, location and organisation
• Other common tasks: recognition of date/time
expressions, measures (percent, money, weight
etc), email addresses etc.
• Other domain-specific entities: names of Drugs,
Genes, medical conditions, names of ships,
bibliographic references etc.
03/04/25 4
IIIT Summer School
03/04/25 IIIT Summer School 5
NER Definition
• Named entity recognition (NER) (also known as entity
identification (EI) and entity extraction) is the task that locate
and classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages,
etc.
John sold 5 companies in 2002.
<ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX
TYPE="QUANTITY">5</NUMEX> companies in <TIMEX
TYPE="DATE">2002</TIMEX>.
What is not NER?
• NER is not event recognition.
• NER does not create templates,
• NER does not perform co-reference or entity linking,
– though these processes are often implemented alongside
NER as part of a larger IE system.
• NER is not just matching text strings with pre-
defined lists of names.
It recognises entities which are being used as entities
in a given context.
• NER is not an easy task!
03/04/25 6
IIIT Summer School
Named Entity and Philosophy of
Language
• Proper Names are defined by
– Descriptivist's theory of Names
• Frege, Russell, Ludwig , Wittgenstein and John Searle
– Causal theory of Reference
• Saul Kripke
03/04/25 7
IIIT Summer School
Descriptivist's theory of Names
Proper names either are synonymous with descriptions, or have their
reference determined by virtue of the name's being associated with
a description or cluster of descriptions that an object uniquely
satisfies.
Causal theory of Reference
Proper names refer to an object by virtue of a causal connection
with the object as mediated through communities of speakers. That
is , proper names, in contrast to descriptions, are rigid designators.
Rigid designators :A proper name refers to the named object in every
possible world in which the object exists.
Descriptions designate : a proper name as different objects in different
possible worlds.
03/04/25 8
IIIT Summer School
Proper Names and Definite Descriptions
• A meaning of a Sentences involving Proper names could be
substituted by a contextually appropriate description for a
name.
eg: Otto von Bismarck can be known or described as the first
Chancellor of the German Empire
Kripke argues that definite descriptions cannot be rigid
designators . Because definite descriptions cannot be
same/similar in all possible worlds
More on Kripke’s Proper name in Naming and Necessity 1980
03/04/25 9
IIIT Summer School
03/04/25 IIIT Summer School 10
What is Named Entity
• Named Entities are
– A Noun Phrase
– Rigid Designators : It designates/denotes the same
thing in all possible worlds in which the same thing
exists and does not designate anything else in those
possible worlds in which that same thing does not
exist
03/04/25 IIIT Summer School 11
EXAMPLES for Named Entity and not a
Named entity
• Hotel & Taj Hotel
• Flower & Rose Flower
• Beach & Kovalam Beach
• Airport & Indira Gandhi International airport
• The School & Good Shepherd School
• Prime Minister & Mr. Manmohan Singh
03/04/25 IIIT Summer School 12
Some problems in indentifying NE
• Variation of NEs.
– Manmohan Singh, Manmohan, Dr. Manmohan
Singh
• Ambiguity of NE types:
– 1945 (date vs. time)
– Washington (location vs. person)
– May (person vs. month)
– Tata (person vs. organization)
03/04/25 IIIT Summer School 13
Ambiguity Examples
• Person vs Location
– Sir C. P Ramaswamy was the Divan of Travancore
(Per)
– Sir C.P Ramaswamy Road is in Chennai (Loc)
• Person vs Organization
– Anil Ambani opened Reliance Fresh (Per)
– Reliance Fresh is under Anil Amabani Group Ltd
(Org)
More complex problems in NER
Issues of style, structure, domain, genre etc.
– Punctuation, spelling, spacing, formatting, ….all have an
impact
Dept. of Computing and Information Science
Manchester Metropolitan University
Manchester
United Kingdom
> Tell me more about Leonardo
> Da Vinci
03/04/25 14
IIIT Summer School
Problems in NE Task Definition
• Category definitions are intuitively quite clear,
but there are many grey areas.
• Many of these grey area are caused by
metonymy.
Person vs. Artefact
Organisation vs. Location
Company vs. Artefact
Location vs. Organisation
03/04/25 15
IIIT Summer School
03/04/25 IIIT Summer School 16
Tagset for Named Entity
• ACE tagset is Hierarchical
– ACE-Automatic Content Extraction
• The tagset
– CLIA-is Hierarchical -Similar to ACE
– Developed for two domains
• Tourism and Health
03/04/25 IIIT Summer School 17
TAGSET
• ENAMEX
– Person
• Individual
– Family name
– Title
• Group
– Organization
• Government
• Public/private company
• Religious
• Non-government
– Political Party
– Para military
– Charitable
– Association
• GPE (Geo-political Social Entity)
• Media
– Location
• Place
– District
– City
– State
– Nation
– Continent
• Address
• Water-bodies
• Landscapes
• Celestial Bodies
– Manmade
» Religious Places
» Roads/Highways
» Museum
» Theme parks/Parks/Gardens
» Monuments
• Facilities
– Hospitals
• Institutes
• Library
– Hotel/Restaurants/Lodges
– Plant/Factories
– Police Station/Fire Services
– Public Comfort Stations
– Airports
– Ports
– Bus-Stations
• Locomotives
• Artifacts
– Implements
– Ammunition
– Paintings
– Sculptures
– Cloths
– Gems & Stones
• Entertainment
– Dance
– Music
– Drama/Cinema
– Sports
– Events/Exhibitions/Conferences
• Cuisine’s
• Animals
• Plants
03/04/25 IIIT Summer School 18
Tagset Continued
• NUMEX
– Distance
• Money
– Quantity
– Count
• TIMEX
– Time
– Date
– Day
– Period
Tagset Counts
First Level Tags -3
Second Level -43
Third Level – 40
Total - 86
03/04/25 IIIT Summer School 19
How to Annotate
• 1.ENAMEX
– 1.1 Person
• 1.1.1 Individual
• These refer to names of each individual person, also includes names of
fictional characters found in stories/novels etc.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul Kalam</ENAMEX>
03/04/25 IIIT Summer School 20
Annotation continued
1.1.1.1 Family name
In general we find that a person name consists of a family name.
Whenever an instance of individual name occurs with family name, then
that part of the name, which refers to family name, must be tagged
specifically with subtag “FAMILYNAME” as shown below.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”
SUBTYPE_2= “FAMILYNAME”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu
Prasad<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”
SUBTYPE_2= “FAMILYNAME”>Yadav</ENAMEX></ENAMEX>
NE Types
NE TYPES
ENAMEX
NUMEX
TIMEX
The Named entity hierarchy is divided into three major classes Entity
Name, Time and Numerical expressions.
03/04/25 21
IIIT Summer School
Entity Types
03/04/25 22
IIIT Summer School

Persons are entities limited to humans. A person may be a single
individual or a group. Individual refer to names of each individual person.
Group refers to set of individual

Location entities are limited to geographical entities such as geographical
areas like names of countries, cities, continents and landmasses, bodies of
water, and geological formations.

Organization entities are limited to corporations, agencies, and other
groups of people defined by an established organizational structure
Entity Name Types
03/04/25 23
IIIT Summer School
 En: [Sita]PERSON is working at [HCL]ORGANIZATION , which is in [Chennai] LOCATION
 Ta: [Seetha] PERSON [chennaiyilrukkira]LOCATION [HCLlil] ORGANIZATION
En: Sita Chennai HCL
velaiseikirAl.
Working
 Ml: [Seetha] PERSON [chennaiyillula] LOCATION [HCLlil]ORGANIZATION
En: Sita Chennai HCL
jolicheyyunnu.
Working
 Hi: [Seetha] PERSON [HCL]ORGANIZATION main kaam kar raha hai, jo
En: Sita HCL work is which
[chennai]LOCATION main hain.
Chennai in
Examples for Entity Name Types
03/04/25 24
IIIT Summer School
Facilities are limited to buildings and other permanent man-made structures
and real estate improvements like hospitals, airport, colleges, libraries etc.
En: [Appolo Hospital]FACILITY is in Chennai LOCATION
Ta: [Appallo maruthuvamanAi]FACILITY [Chennaiyil]LOCATION
irukkirathu
Ml: [Appolo Asupathri]FACILITY [chennaiyil]LOCATION aaN
Hi: [Appolo aspathaal]FACILITY [chennai]LOCATION mein haim.
Entity Name Types
03/04/25 25
IIIT Summer School
A locomotive entity is a physical device primarily designed to move an object
from one location to another, by carrying, pulling, or pushing the transported
object.
En: [Ananthapuri Express]LOCOMOTIVE departs from [Chennai] LOCATION at
[7.30pm]Time.
Hi: [Ananthapuri express] LOCOMOTIVE [Chennai]LOCATION se [rAth 7.30] TIME ko
ravana hoga
Ml: [Ananthapuri eksprass] LOCOMOTIVE [chennaiyilninn] LOCATION [raathri 7.30
maNikk]TIME puRappetum.
Ta: [Ananthapuri viraivu rayil] LOCOMOTIVE [chennaiyilirunthu] LOCATION [iRavu
7.30 maNikku]TIME puRappatukirathu
Entity Name Types
03/04/25 26
IIIT Summer School
Artifact entities are objects or things, produced or shaped by human craft,
such
as tools, weapons/ammunition, art paintings, clothes, ornaments, medicines
En: [Vinayaga Statue]ARTIFACT is looking beautiful
Ta: [Vinayakarin Silai]ARTIFACT pArpatharkku alakAkAkairukkirathu
Ml: [ganapathi vigraham]ARTIFACT baMgiyaayi irikkunnu.
Hi: [Vinayaka moorthi]ARTIFACT achi lagh rahi haim.
Entity Name Types
03/04/25 27
IIIT Summer School
Entertainment entities denote activities, which are diverting and hold human
attention or interest, giving pleasure, happiness, amusement especially
performance of some kind such as dance, music, sports, events.
En: [Flower Exhibition]ENTERTAINMENT is held at [Hyderabad]LOCATION
Ta: [Malar kankAtchi]ENTERTAINMENT [hyderabaadil]LOCATION Nadaiperukirathu
Ml: [pushpa pradarshanam] ENTERTAINMENT [hyderabaadil] LOCATION natakkunnu
Hi: [phool pradarshnii] ENTERTAINMENT [hyderabad] LOCATION meN Ayojith kiyaa
jAthA hai
Entity Name Types
03/04/25 28
IIIT Summer School
Materials refer to the names of food items, cuisines, chemicals and
cosmetics
En: [Honey]MATERIALS is good for face
Ta: [ThEn]MATERIALS mukaththiRku nallathu
Ml: [Madhu] MATERIALS mukaththinu nallathAN
Hi: [Shahad]MATERIALS chehare ke liye achcha hai.
Entity Name Types
03/04/25 29
IIIT Summer School
ORGANISMS: These are the names of different animal species including
birds, reptiles, viruses, bacteria and names of herbs, medicinal plants, shrubs,
trees, fruits, flowers etc.
En: [Peacock] ORGANISM is the national bird of [India] LOCATION
Ta: [Mayil] ORGANISM [InthiyAvin] LOCATION thEciyappaRavai Akum.
Ml: [Mayil] ORGANISM [indyayute] LOCATION raashtrapakshi AN.
Hi: [Mor]ORGANISM [bhaarath] LOCATION kaa raashtrIya pakshi hai.
Entity Name Types
03/04/25 30
IIIT Summer School
Disease: Names of disease, symptoms, diagonisis and treatment are comes
under this type.
En: Smoking Causes [Cancer] DISEASE
Ta: PukaippithithalAl [puRRuNoi] DISEASE varukiRathu
Ml : pukavali [aRbhudham] DISEASE uNtAkkunnu
Hi: dhumrapan [kaansar] DISEASE ka kaaraN banaatha hai.
Entity Name Types
03/04/25 31
IIIT Summer School
Numerical Expressions
NUMEX
DISTANCE
QUANTITY
COUNT
MONEY
03/04/25 32
IIIT Summer School

Distance refers to the distance measures such as kilometers, Centimeters,
meters, acres, feet etc.
Example: 10 cm., twenty feet, 15 hectares

Money specifies the different currency value such as rupee, euro, Dinar,
dollar etc.
Example: Rs. 1000, 250 Euro, $160

Count denotes the number (or counts) of Items/ articles/things etc.
Example: 5 subjects, 12 students, 20 books

Quantity measurements like liters, tons, grams, volts etc. are comes under
this category.
Example: 20 litres, 22 kg, 50g, 100 volts
Numerical Expressions
03/04/25 33
IIIT Summer School
Time Expressions
TIMEX
MONTH DATE
TIME YEAR
PERIOD
DAY SPECIAL DAY
03/04/25 34
IIIT Summer School

Temporal expressions are the entities refers to time, date, year, month and day

Time: These refer to expressions of time, includes different forms

of expressing time. This also includes Hours, minutes and seconds.

Example

5’o clock in the morning

9.30 a.m.

Evening 6.30 p.m.

Date: This refers to expressions of Date such as 13/12/2001 etc in

different forms. This also includes month, date and year

Example

August 15 1947

1956

September 11
Temporal Expressions
03/04/25 35
IIIT Summer School
Day: These are expressions, which convey days in a year. Also it can include
days occurring weekly /fortnightly/ monthly /quarterly/ biennial etc.
Example
Sunday
Tomorrow
Today
Yesterday
Special Day: refers to special days in a year
Example
Gandhi Jayanthi
Rama Navami
Temporal Expressions
03/04/25 36
IIIT Summer School
Period: refers to expressions, which express duration of time or
time periods or time intervals.
Example
 17 th century
 10 minutes
 10 a.m. to 12 p.m.
 One year
Temporal Expressions
03/04/25 37
IIIT Summer School
Methodologies
Methods:
1)Rule Based
2)Machine Learning
Hidden Markov Model (HMM)
Naïve Bayes Classifier
Maximum Entropy Markov Model (MEMM)
Conditional random Fields (CRF)
4) Hybrid Approach
03/04/25 38
IIIT Summer School
Following are the major challenges encountering in Indian
Languages.
Agglutination
Ambiguity
Between Proper and common nouns
Between named entities
Lack of Capitalization
Challenges of NER in Indian Languages
03/04/25 39
IIIT Summer School
Agglutination
In Dravidian languages, words consist of a lexical root to which one or more
affixes are attached.
Example in Tamil:
1) Ta: Ramanaiththavira
(otherthan Raman)
2) Ta: Cevvaiyandru
(On Tuesday)
3) Ta: Inthiyavilllula
(In India)
4) Ta: KannanaippaRRikkondu
(hold onto Kannan)
Challenges of NER in Indian Languages
03/04/25 40
IIIT Summer School
Example in Malayalam:
1) Ml: hemayiluNtaayirunna
(that which Hema have)
2) Ml: Chennaiyilethunna
(reach in Chennai)
3) Ml: arabikatalinaBimukhamaayi
(towards the arabian sea)
4) Ml: kaaSiyilekkozhukunna
( flowing towards kaaSi)
Challenges of NER in Indian Languages
03/04/25 41
IIIT Summer School
 Ambiguity
 Comparatively Indian languages suffer more due to the ambiguity that
exists between common & proper nouns and between named entities itself.
In some cases same word can refer to different named entity types. Those
instances can recognized by contextual information.
 Examples:
Hi: Akash - Person name and Sky
Hi: Sooraj - Person name and Sun
Hi: Chaanth – Moon and Silver
Hi: Aam – Mango and Common
Ml: Roopa – Person name and Rupee
Ml: Madhu – Person name and Honey
Ml: Mala – Person name and Garland
Challenges of NER in Indian Languages
03/04/25 42
IIIT Summer School
 Ta: Thinkal - Day and Month
 Ta: Malar - Person name and Flower
 Ta: Chevvai - Day and planet
 Ta: Shakthi – Person name and Power
 Ta: MAlai – Evening and Garland
 Ta & Ml: Velli – Silver, Planet, Day
Challenges of NER in Indian Languages
03/04/25 43
IIIT Summer School
Spell Variation: Due to the different writing styles same entity is
represented in various word forms. In Tamil, sanskirit letters
such as “ja”, “sha”, “sri” “Ha” are replaced by “sa”,“ciri”, “ka”
Example:
Roja can be written as Rosa
Srimathi - cirimathi
Raja - rasa
ShajahAn - sajakAn
Challenges of NER in Indian Languages
03/04/25 44
IIIT Summer School
Lack of Capitalization

In English and some other European languages capitalization is considered
as the important feature to identify proper noun.

It plays a major role in NE identification.

Unlike English capitalization concept is not found in Indian languages.
Challenges of NER in Indian Languages
03/04/25 45
IIIT Summer School
Nested Entities: Refers to the named entities which occurs within another
named entities. Also called as embedded entities.
Ta: [[Mathurai] LOCATION [MeenAtchi Amman]PERSON Koyil]RELPLACE
En: Mathurai Meenatchi Amman Temple
Ml: [[Nittoor] PERSON Srinivasa rao] PERSON
En : Nitoor Srinivasa rao
Hi: [[Rajeev]PERSON MArg] ROAD
En : Rajeev Road
Nested Entities
03/04/25 46
IIIT Summer School
03/04/25 IIIT Summer School 47
Approaches in Named Entity Resolution
• Dictionary Look-up
• Rule based ( Using lexical, contextual and morphological
information)
• Maximum entropy theory based
• Hidden Markov Model
• Conditional Random Fields
• Hybrid methods (Statistical+ Linguistics)
03/04/25 IIIT Summer School 48
Dictionary (Gazetteers) Look-up Approach
• Uses Dictionaries for identifying NERs
( Gazetteers)
• Gazetteer contains NEs from all domains
• Advantage
– Very simple approach
– Gives very high precision
03/04/25 IIIT Summer School 49
Disadvantages of Dictionary Approach
• Preparation of exhaustive dictionary is a
tedious and expensive process.
• The dictionary should cover the different
spellings of the same place.
03/04/25 IIIT Summer School 50
Rule Based Approach
• Rule Based System
– Needs more rules to tag all kinds of NE
• Advantages:
– Rich and expressive rules
– Good results
• Disadvantages:
– Requires huge experience and grammatical knowledge
– Experts to craft rules are expensive
– Highly domain specific ( not portable to a new domain)
General difficulties
“ Italy's business world was rocked by the
announcement last Thursday that Mr. Verdi would
leave his job as vice-president of Music Masters of
Milan, Inc. to become operations director of Arthur
Andersen".
• Capitalization useless for first word
• S not part of name "Italy"
• Date is "last Thursday" not "Thursday"
• Milan is location, not organization
• Arthur Andersen is organization, not person
03/04/25 51
IIIT Summer School
Rules success and failure
Title Capitalized_Word Title Person_Name
Correct: Mr. Jones
Incorrect: Mrs. Field's Cookies (corporation)
Month_name number_less_than_32 Date
Correct: February 28
Incorrect: Long March 3 (a Chinese Rocket)
From Date to Date Date
Correct: from August 3 to August 9
Incorrect: I moved my trip from April to June (two
separate dates)
03/04/25 52
IIIT Summer School
Statistical based approach
• Need to identify features
• Feature selection has to be correct for all
types of NE
• Development of Tagged Corpus
• The Corpus should contain all types of tags in
appropriate number
• Domain based corpus has to be generated.
03/04/25 53
IIIT Summer School
Automated approaches
Address drawbacks of hand-coded system
Automated training
• Human-annotated (with desired output
standards) training data
• Annotation requires less effort and expertise
than hand-coding rules
• Annotation accuracy
• Two annotators for checking, third annotator to
resolve disputes
03/04/25 54
IIIT Summer School
Literature Survey
1) Named Entity Recognition was one of the tasks defined in Message Understanding
Conference(MUC) 6.
2) A survey on Named Entity Recognition was done by David Nadeau (2007).
3) Techniques used include:
- rule based technique by Krupka (1998)
- using maximum entropy by Borthwick (1998)
- using Hidden Markov Model by Bikel (1997)
- bootstrapping approach using concept based seeds (Niu et al., 2003)
- hybrid approaches such as rule based tagging for certain entities such as date,
time, percentage and maximum entropy based approach for entities like location
and organization (Rohini et al.,2000)
4) The Stanford NER software (Finkel et al., 2005), uses linear chain CRFs in their
NER engine. Here they identify three classes of NERs viz., Person, Organization
and Location.
03/04/25 55
IIIT Summer School
 Arulmozhi, P. and Sobha, L. (2006). HMM-based Part of Speech Tagger for Relatively Free
 Word Order Language. Advances in Natural Language Processing, Research in Computing
Science Journal, Mexico Volume18, pp. 37-48.
 Bikel, D. M. Miller, S. Schwartz, R. Weischedel, R. (1997). Nymble: A high-performance
learning name-finder. In Fifth Conference on Applied Natural Language Processing. pp.
194201.
 Borthwick, A. Sterling, J. Agichtein, E. and Grishman, R. (1998). Description of the MENE
named Entity System. In Seventh Machine Understanding Conference (MUC-7).
 Chen, W. Zhang, Y. and Isahara, H. (2006). Chinese Named Entity Recognition with
Conditional Random Fields. In Fifth SIGHAN Workshop on Chinese Language Processing,
Sydney. pp.118-121.
 Ekbal, A. Bandyopadhyay, S. (2009). A Conditional Random Field Approach for Named
Entity Recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1).
pp.1-44.
References
03/04/25 56
IIIT Summer School
 Finkel, J. N. Grenager, T. and Manning, C. (2005). Incorporating Non-local Information into Information
Extraction Systems by Gibbs Sampling. In 43nd Annual Meeting of the Association for Computational
Linguistics (ACL 2005). pp. 363-370.
 Finkel, J. Dingare, S. Nguyen, H. Nissim, M. Sinclair, G. and Manning, C. (2004). Exploiting Context for
Biomedical Entity Recognition: from Syntax to the Web. In Joint Workshop on Natural Language
Processing in Biomedicine and its Applications, (NLPBA), Geneva, Switzerland.
 Gali, K. Surana, H. Vaidya, A. Shishtla, P. Sharma, D. M. (2008). Aggregating Machine Learning and Rule
Based Heuristics for Named Entity Recognition. In Workshop on NER for South and South East Asian
Languages, IJCNLP-08, Hyderabad, India.
 Kumar, K. N. Santosh, G. S. K. Varma, V. (2011). A Language-Independent Approach to Identify
the Named Entities in under-resourced languages and Clustering Multilingual Documents. In
International Conference on Multilingual and Multimodal Information Access Evaluation, University
of Amsterdam, Netherlands.
 Lafferty, J. McCallum, A. Pereira, F. (2001). Conditional Random Fields for segmenting and
labeling sequence data. In ICML-01, pp. 282-289.
 Loinaz, I.A. Uriarte, O. A. Ramos, N. E. Castro, M. I. F. D (2006). Lessons from the Development
of Named Entity Recognizer for Basque. Natural Language Processing, 36. pp. 25 – 37.
 McCallum, A. and Li, W. (2003). Early Results for Named Entity Recognition with Conditional
Random Fields, Feature Induction and Web-Enhanced Lexicons. In Seventh Conference on
Natural Language Learning (CoNLL).
References
03/04/25 57
IIIT Summer School
 Nadeau, David and Sekine, S. (2007) A survey of named entity recognition and classification.
Linguisticae Investigationes 30(1). pp.3–26.
 Niu, C. Li, W. Ding, J. Srihari, R. K. (2003). Bootstrapping for Named Entity Tagging using
Concept-based Seeds. In HLT-NAACL’03, Companion Volume, Edmonton, AT. pp.73-75.
 Pandian, S. Lakshmana, Geetha, T. V. and Krishna. (2007). Named Entity Recognition in Tamil
using Context-cues and the E-M algorithm. In the Proceedings of the 3rd Indian International
Conference on Artificial Intelligence, Pune, India. pp. 1951 -1958.
 Sasidhar, B., Yohan, P.M., Babu, V.A., Govarhan, A.(2011). A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu. J. International Journal of
Computer Science Issues, Volume. 8, pp. 1694-0814 .
 Sobha, L., Vijay Sundar Ram. R. (2006). "Noun Phrase Chunker for Tamil", In Proceedings of
Symposium on Modeling and Shallow Parsing of Indian Languages, Indian Institute of
Technology, Mumbai, pp 194-198.
 Srihari, R.K. Niu, C. Yu, L. (2000). A Hybrid Approach for Named Entity Recognition in Indian
Languages. In 6th Applied Natural Language Conference, pp. 247-254
 Gupta, S. and Bhattacharyya, P. (2010). Think globally, apply locally: using distributional
characteristics for Hindi named entity identification. In 2010 Named Entities Workshop,
Association for Computational Linguistics Stroudsburg, PA, USA
 Vijayakrishna, R. and Sobha, L. (2008). Domain focused Named Entity for Tamil using Conditional
Random Fields. In IJNLP-08 workshop on NER for South and South East Asian Languages,
Hyderabad, India. pp. 59-66
References
03/04/25 58
IIIT Summer School
Literature Survey
Indian Languages:
5) Named Entity recognition for Hindi, Bengali, Oriya, Telugu and Urdu (some of the major
Indian languages) were addressed as a shared task in the NERSSEAL workshop of IJCNLP. The
tagset used here consisted of 12 tags.
6) Vijayakrishna & Sobha (2008) worked on Domain focused Tamil Named Entity Recognizer for
Tourism domain using CRF. It handles nested tagging of named entities with a hierarchical tag set
containing 106 tags. They considered root of words, POS, combined word and POS, Dictionary of
named entities as features to build the system.
7) Pandian et al (2007) have built a Tamil NER system using contextual cues and E-M algorithm.
8) The NER system (Gali et al., 2008) build for NERSSEAL-2008 shared task which combines the
machine learning techniques with language specific heuristics. The system has been tested on five
languages such as Telugu, Hindi, Bengali, Urdu and Oriya using CRF followed by post processing
which involves some heuristics.
03/04/25 59
IIIT Summer School
Thank you
03/04/25 60
IIIT Summer School

More Related Content

PPT
782893827-7-NER-in-Details.ppt JNHJBHJJBGBGU
PPT
782893827-7-NER-in-Details.ppt JKHKH JDJ
PDF
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
PDF
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
PPTX
Reading Group 2013 (DERI NUIG)
PDF
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
PPTX
What is a named entity
PDF
Entity Linking
782893827-7-NER-in-Details.ppt JNHJBHJJBGBGU
782893827-7-NER-in-Details.ppt JKHKH JDJ
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
Reading Group 2013 (DERI NUIG)
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
What is a named entity
Entity Linking

Similar to sobha-ner.ppt named entity recognition model (20)

PDF
Domain Specific Named Entity Recognition Using Supervised Approach
PDF
Named Entity recognition in Sanskrit
PPTX
Named Entity Recognition - ACL 2011 Presentation
PDF
A study on the approaches of developing a named entity recognition tool
PDF
Mining named entities -IIITH
PPTX
Information retrieval and extraction
PDF
Information Extraction with Linked Data
PDF
A survey of named entity recognition in assamese and other indian languages
PDF
IE: Named Entity Recognition (NER)
PDF
Perspectives on mining knowledge graphs from text
PPT
Download
PPT
Download
PDF
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
PDF
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
PDF
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
PDF
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
PDF
Named Entity Recognition Using Web Document Corpus
PDF
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
PPTX
Introduction to Named Entity Recognition
PDF
D017422528
Domain Specific Named Entity Recognition Using Supervised Approach
Named Entity recognition in Sanskrit
Named Entity Recognition - ACL 2011 Presentation
A study on the approaches of developing a named entity recognition tool
Mining named entities -IIITH
Information retrieval and extraction
Information Extraction with Linked Data
A survey of named entity recognition in assamese and other indian languages
IE: Named Entity Recognition (NER)
Perspectives on mining knowledge graphs from text
Download
Download
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
Named Entity Recognition Using Web Document Corpus
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
Introduction to Named Entity Recognition
D017422528
Ad

Recently uploaded (20)

PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Website Design Services for Small Businesses.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms I-SECS-1021-03
Why Generative AI is the Future of Content, Code & Creativity?
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
CHAPTER 2 - PM Management and IT Context
Wondershare Filmora 15 Crack With Activation Key [2025
Advanced SystemCare Ultimate Crack + Portable (2025)
iTop VPN Crack Latest Version Full Key 2025
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Computer Software and OS of computer science of grade 11.pptx
Salesforce Agentforce AI Implementation.pdf
Website Design Services for Small Businesses.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Digital Systems & Binary Numbers (comprehensive )
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Odoo Companies in India – Driving Business Transformation.pdf
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Design an Analysis of Algorithms II-SECS-1021-03
Ad

sobha-ner.ppt named entity recognition model

  • 1. Named Entity Recognition Sobha Lalitha Devi AU-KBC Research Centre Chennai
  • 2. Named Entity(NE) Recognition • What is NE and What is not an NE • How to identify NE • Tagset and Annotation Guidelines • Methods Used in developing NER 03/04/25 2 IIIT Summer School
  • 3. Why do NER? • Key part of Information Extraction system • Robust handling of proper names essential for many applications such as Summarization, IR, Anaphora,......... • Pre-processing for different classification levels • Information filtering • Information linking 03/04/25 3 IIIT Summer School
  • 4. What is NER ? • NER involves identification of proper names in texts, and classification into a set of predefined categories of interest. • Three universally accepted categories: • Person, location and organisation • Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc. • Other domain-specific entities: names of Drugs, Genes, medical conditions, names of ships, bibliographic references etc. 03/04/25 4 IIIT Summer School
  • 5. 03/04/25 IIIT Summer School 5 NER Definition • Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is the task that locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. John sold 5 companies in 2002. <ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX TYPE="QUANTITY">5</NUMEX> companies in <TIMEX TYPE="DATE">2002</TIMEX>.
  • 6. What is not NER? • NER is not event recognition. • NER does not create templates, • NER does not perform co-reference or entity linking, – though these processes are often implemented alongside NER as part of a larger IE system. • NER is not just matching text strings with pre- defined lists of names. It recognises entities which are being used as entities in a given context. • NER is not an easy task! 03/04/25 6 IIIT Summer School
  • 7. Named Entity and Philosophy of Language • Proper Names are defined by – Descriptivist's theory of Names • Frege, Russell, Ludwig , Wittgenstein and John Searle – Causal theory of Reference • Saul Kripke 03/04/25 7 IIIT Summer School
  • 8. Descriptivist's theory of Names Proper names either are synonymous with descriptions, or have their reference determined by virtue of the name's being associated with a description or cluster of descriptions that an object uniquely satisfies. Causal theory of Reference Proper names refer to an object by virtue of a causal connection with the object as mediated through communities of speakers. That is , proper names, in contrast to descriptions, are rigid designators. Rigid designators :A proper name refers to the named object in every possible world in which the object exists. Descriptions designate : a proper name as different objects in different possible worlds. 03/04/25 8 IIIT Summer School
  • 9. Proper Names and Definite Descriptions • A meaning of a Sentences involving Proper names could be substituted by a contextually appropriate description for a name. eg: Otto von Bismarck can be known or described as the first Chancellor of the German Empire Kripke argues that definite descriptions cannot be rigid designators . Because definite descriptions cannot be same/similar in all possible worlds More on Kripke’s Proper name in Naming and Necessity 1980 03/04/25 9 IIIT Summer School
  • 10. 03/04/25 IIIT Summer School 10 What is Named Entity • Named Entities are – A Noun Phrase – Rigid Designators : It designates/denotes the same thing in all possible worlds in which the same thing exists and does not designate anything else in those possible worlds in which that same thing does not exist
  • 11. 03/04/25 IIIT Summer School 11 EXAMPLES for Named Entity and not a Named entity • Hotel & Taj Hotel • Flower & Rose Flower • Beach & Kovalam Beach • Airport & Indira Gandhi International airport • The School & Good Shepherd School • Prime Minister & Mr. Manmohan Singh
  • 12. 03/04/25 IIIT Summer School 12 Some problems in indentifying NE • Variation of NEs. – Manmohan Singh, Manmohan, Dr. Manmohan Singh • Ambiguity of NE types: – 1945 (date vs. time) – Washington (location vs. person) – May (person vs. month) – Tata (person vs. organization)
  • 13. 03/04/25 IIIT Summer School 13 Ambiguity Examples • Person vs Location – Sir C. P Ramaswamy was the Divan of Travancore (Per) – Sir C.P Ramaswamy Road is in Chennai (Loc) • Person vs Organization – Anil Ambani opened Reliance Fresh (Per) – Reliance Fresh is under Anil Amabani Group Ltd (Org)
  • 14. More complex problems in NER Issues of style, structure, domain, genre etc. – Punctuation, spelling, spacing, formatting, ….all have an impact Dept. of Computing and Information Science Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci 03/04/25 14 IIIT Summer School
  • 15. Problems in NE Task Definition • Category definitions are intuitively quite clear, but there are many grey areas. • Many of these grey area are caused by metonymy. Person vs. Artefact Organisation vs. Location Company vs. Artefact Location vs. Organisation 03/04/25 15 IIIT Summer School
  • 16. 03/04/25 IIIT Summer School 16 Tagset for Named Entity • ACE tagset is Hierarchical – ACE-Automatic Content Extraction • The tagset – CLIA-is Hierarchical -Similar to ACE – Developed for two domains • Tourism and Health
  • 17. 03/04/25 IIIT Summer School 17 TAGSET • ENAMEX – Person • Individual – Family name – Title • Group – Organization • Government • Public/private company • Religious • Non-government – Political Party – Para military – Charitable – Association • GPE (Geo-political Social Entity) • Media – Location • Place – District – City – State – Nation – Continent • Address • Water-bodies • Landscapes • Celestial Bodies – Manmade » Religious Places » Roads/Highways » Museum » Theme parks/Parks/Gardens » Monuments • Facilities – Hospitals • Institutes • Library – Hotel/Restaurants/Lodges – Plant/Factories – Police Station/Fire Services – Public Comfort Stations – Airports – Ports – Bus-Stations • Locomotives • Artifacts – Implements – Ammunition – Paintings – Sculptures – Cloths – Gems & Stones • Entertainment – Dance – Music – Drama/Cinema – Sports – Events/Exhibitions/Conferences • Cuisine’s • Animals • Plants
  • 18. 03/04/25 IIIT Summer School 18 Tagset Continued • NUMEX – Distance • Money – Quantity – Count • TIMEX – Time – Date – Day – Period Tagset Counts First Level Tags -3 Second Level -43 Third Level – 40 Total - 86
  • 19. 03/04/25 IIIT Summer School 19 How to Annotate • 1.ENAMEX – 1.1 Person • 1.1.1 Individual • These refer to names of each individual person, also includes names of fictional characters found in stories/novels etc. Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX> Examples: English: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul Kalam</ENAMEX>
  • 20. 03/04/25 IIIT Summer School 20 Annotation continued 1.1.1.1 Family name In general we find that a person name consists of a family name. Whenever an instance of individual name occurs with family name, then that part of the name, which refers to family name, must be tagged specifically with subtag “FAMILYNAME” as shown below. Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”> abc </ENAMEX> Examples: English: <ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu Prasad<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”>Yadav</ENAMEX></ENAMEX>
  • 21. NE Types NE TYPES ENAMEX NUMEX TIMEX The Named entity hierarchy is divided into three major classes Entity Name, Time and Numerical expressions. 03/04/25 21 IIIT Summer School
  • 23.  Persons are entities limited to humans. A person may be a single individual or a group. Individual refer to names of each individual person. Group refers to set of individual  Location entities are limited to geographical entities such as geographical areas like names of countries, cities, continents and landmasses, bodies of water, and geological formations.  Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure Entity Name Types 03/04/25 23 IIIT Summer School
  • 24.  En: [Sita]PERSON is working at [HCL]ORGANIZATION , which is in [Chennai] LOCATION  Ta: [Seetha] PERSON [chennaiyilrukkira]LOCATION [HCLlil] ORGANIZATION En: Sita Chennai HCL velaiseikirAl. Working  Ml: [Seetha] PERSON [chennaiyillula] LOCATION [HCLlil]ORGANIZATION En: Sita Chennai HCL jolicheyyunnu. Working  Hi: [Seetha] PERSON [HCL]ORGANIZATION main kaam kar raha hai, jo En: Sita HCL work is which [chennai]LOCATION main hain. Chennai in Examples for Entity Name Types 03/04/25 24 IIIT Summer School
  • 25. Facilities are limited to buildings and other permanent man-made structures and real estate improvements like hospitals, airport, colleges, libraries etc. En: [Appolo Hospital]FACILITY is in Chennai LOCATION Ta: [Appallo maruthuvamanAi]FACILITY [Chennaiyil]LOCATION irukkirathu Ml: [Appolo Asupathri]FACILITY [chennaiyil]LOCATION aaN Hi: [Appolo aspathaal]FACILITY [chennai]LOCATION mein haim. Entity Name Types 03/04/25 25 IIIT Summer School
  • 26. A locomotive entity is a physical device primarily designed to move an object from one location to another, by carrying, pulling, or pushing the transported object. En: [Ananthapuri Express]LOCOMOTIVE departs from [Chennai] LOCATION at [7.30pm]Time. Hi: [Ananthapuri express] LOCOMOTIVE [Chennai]LOCATION se [rAth 7.30] TIME ko ravana hoga Ml: [Ananthapuri eksprass] LOCOMOTIVE [chennaiyilninn] LOCATION [raathri 7.30 maNikk]TIME puRappetum. Ta: [Ananthapuri viraivu rayil] LOCOMOTIVE [chennaiyilirunthu] LOCATION [iRavu 7.30 maNikku]TIME puRappatukirathu Entity Name Types 03/04/25 26 IIIT Summer School
  • 27. Artifact entities are objects or things, produced or shaped by human craft, such as tools, weapons/ammunition, art paintings, clothes, ornaments, medicines En: [Vinayaga Statue]ARTIFACT is looking beautiful Ta: [Vinayakarin Silai]ARTIFACT pArpatharkku alakAkAkairukkirathu Ml: [ganapathi vigraham]ARTIFACT baMgiyaayi irikkunnu. Hi: [Vinayaka moorthi]ARTIFACT achi lagh rahi haim. Entity Name Types 03/04/25 27 IIIT Summer School
  • 28. Entertainment entities denote activities, which are diverting and hold human attention or interest, giving pleasure, happiness, amusement especially performance of some kind such as dance, music, sports, events. En: [Flower Exhibition]ENTERTAINMENT is held at [Hyderabad]LOCATION Ta: [Malar kankAtchi]ENTERTAINMENT [hyderabaadil]LOCATION Nadaiperukirathu Ml: [pushpa pradarshanam] ENTERTAINMENT [hyderabaadil] LOCATION natakkunnu Hi: [phool pradarshnii] ENTERTAINMENT [hyderabad] LOCATION meN Ayojith kiyaa jAthA hai Entity Name Types 03/04/25 28 IIIT Summer School
  • 29. Materials refer to the names of food items, cuisines, chemicals and cosmetics En: [Honey]MATERIALS is good for face Ta: [ThEn]MATERIALS mukaththiRku nallathu Ml: [Madhu] MATERIALS mukaththinu nallathAN Hi: [Shahad]MATERIALS chehare ke liye achcha hai. Entity Name Types 03/04/25 29 IIIT Summer School
  • 30. ORGANISMS: These are the names of different animal species including birds, reptiles, viruses, bacteria and names of herbs, medicinal plants, shrubs, trees, fruits, flowers etc. En: [Peacock] ORGANISM is the national bird of [India] LOCATION Ta: [Mayil] ORGANISM [InthiyAvin] LOCATION thEciyappaRavai Akum. Ml: [Mayil] ORGANISM [indyayute] LOCATION raashtrapakshi AN. Hi: [Mor]ORGANISM [bhaarath] LOCATION kaa raashtrIya pakshi hai. Entity Name Types 03/04/25 30 IIIT Summer School
  • 31. Disease: Names of disease, symptoms, diagonisis and treatment are comes under this type. En: Smoking Causes [Cancer] DISEASE Ta: PukaippithithalAl [puRRuNoi] DISEASE varukiRathu Ml : pukavali [aRbhudham] DISEASE uNtAkkunnu Hi: dhumrapan [kaansar] DISEASE ka kaaraN banaatha hai. Entity Name Types 03/04/25 31 IIIT Summer School
  • 33.  Distance refers to the distance measures such as kilometers, Centimeters, meters, acres, feet etc. Example: 10 cm., twenty feet, 15 hectares  Money specifies the different currency value such as rupee, euro, Dinar, dollar etc. Example: Rs. 1000, 250 Euro, $160  Count denotes the number (or counts) of Items/ articles/things etc. Example: 5 subjects, 12 students, 20 books  Quantity measurements like liters, tons, grams, volts etc. are comes under this category. Example: 20 litres, 22 kg, 50g, 100 volts Numerical Expressions 03/04/25 33 IIIT Summer School
  • 34. Time Expressions TIMEX MONTH DATE TIME YEAR PERIOD DAY SPECIAL DAY 03/04/25 34 IIIT Summer School
  • 35.  Temporal expressions are the entities refers to time, date, year, month and day  Time: These refer to expressions of time, includes different forms  of expressing time. This also includes Hours, minutes and seconds.  Example  5’o clock in the morning  9.30 a.m.  Evening 6.30 p.m.  Date: This refers to expressions of Date such as 13/12/2001 etc in  different forms. This also includes month, date and year  Example  August 15 1947  1956  September 11 Temporal Expressions 03/04/25 35 IIIT Summer School
  • 36. Day: These are expressions, which convey days in a year. Also it can include days occurring weekly /fortnightly/ monthly /quarterly/ biennial etc. Example Sunday Tomorrow Today Yesterday Special Day: refers to special days in a year Example Gandhi Jayanthi Rama Navami Temporal Expressions 03/04/25 36 IIIT Summer School
  • 37. Period: refers to expressions, which express duration of time or time periods or time intervals. Example  17 th century  10 minutes  10 a.m. to 12 p.m.  One year Temporal Expressions 03/04/25 37 IIIT Summer School
  • 38. Methodologies Methods: 1)Rule Based 2)Machine Learning Hidden Markov Model (HMM) Naïve Bayes Classifier Maximum Entropy Markov Model (MEMM) Conditional random Fields (CRF) 4) Hybrid Approach 03/04/25 38 IIIT Summer School
  • 39. Following are the major challenges encountering in Indian Languages. Agglutination Ambiguity Between Proper and common nouns Between named entities Lack of Capitalization Challenges of NER in Indian Languages 03/04/25 39 IIIT Summer School
  • 40. Agglutination In Dravidian languages, words consist of a lexical root to which one or more affixes are attached. Example in Tamil: 1) Ta: Ramanaiththavira (otherthan Raman) 2) Ta: Cevvaiyandru (On Tuesday) 3) Ta: Inthiyavilllula (In India) 4) Ta: KannanaippaRRikkondu (hold onto Kannan) Challenges of NER in Indian Languages 03/04/25 40 IIIT Summer School
  • 41. Example in Malayalam: 1) Ml: hemayiluNtaayirunna (that which Hema have) 2) Ml: Chennaiyilethunna (reach in Chennai) 3) Ml: arabikatalinaBimukhamaayi (towards the arabian sea) 4) Ml: kaaSiyilekkozhukunna ( flowing towards kaaSi) Challenges of NER in Indian Languages 03/04/25 41 IIIT Summer School
  • 42.  Ambiguity  Comparatively Indian languages suffer more due to the ambiguity that exists between common & proper nouns and between named entities itself. In some cases same word can refer to different named entity types. Those instances can recognized by contextual information.  Examples: Hi: Akash - Person name and Sky Hi: Sooraj - Person name and Sun Hi: Chaanth – Moon and Silver Hi: Aam – Mango and Common Ml: Roopa – Person name and Rupee Ml: Madhu – Person name and Honey Ml: Mala – Person name and Garland Challenges of NER in Indian Languages 03/04/25 42 IIIT Summer School
  • 43.  Ta: Thinkal - Day and Month  Ta: Malar - Person name and Flower  Ta: Chevvai - Day and planet  Ta: Shakthi – Person name and Power  Ta: MAlai – Evening and Garland  Ta & Ml: Velli – Silver, Planet, Day Challenges of NER in Indian Languages 03/04/25 43 IIIT Summer School
  • 44. Spell Variation: Due to the different writing styles same entity is represented in various word forms. In Tamil, sanskirit letters such as “ja”, “sha”, “sri” “Ha” are replaced by “sa”,“ciri”, “ka” Example: Roja can be written as Rosa Srimathi - cirimathi Raja - rasa ShajahAn - sajakAn Challenges of NER in Indian Languages 03/04/25 44 IIIT Summer School
  • 45. Lack of Capitalization  In English and some other European languages capitalization is considered as the important feature to identify proper noun.  It plays a major role in NE identification.  Unlike English capitalization concept is not found in Indian languages. Challenges of NER in Indian Languages 03/04/25 45 IIIT Summer School
  • 46. Nested Entities: Refers to the named entities which occurs within another named entities. Also called as embedded entities. Ta: [[Mathurai] LOCATION [MeenAtchi Amman]PERSON Koyil]RELPLACE En: Mathurai Meenatchi Amman Temple Ml: [[Nittoor] PERSON Srinivasa rao] PERSON En : Nitoor Srinivasa rao Hi: [[Rajeev]PERSON MArg] ROAD En : Rajeev Road Nested Entities 03/04/25 46 IIIT Summer School
  • 47. 03/04/25 IIIT Summer School 47 Approaches in Named Entity Resolution • Dictionary Look-up • Rule based ( Using lexical, contextual and morphological information) • Maximum entropy theory based • Hidden Markov Model • Conditional Random Fields • Hybrid methods (Statistical+ Linguistics)
  • 48. 03/04/25 IIIT Summer School 48 Dictionary (Gazetteers) Look-up Approach • Uses Dictionaries for identifying NERs ( Gazetteers) • Gazetteer contains NEs from all domains • Advantage – Very simple approach – Gives very high precision
  • 49. 03/04/25 IIIT Summer School 49 Disadvantages of Dictionary Approach • Preparation of exhaustive dictionary is a tedious and expensive process. • The dictionary should cover the different spellings of the same place.
  • 50. 03/04/25 IIIT Summer School 50 Rule Based Approach • Rule Based System – Needs more rules to tag all kinds of NE • Advantages: – Rich and expressive rules – Good results • Disadvantages: – Requires huge experience and grammatical knowledge – Experts to craft rules are expensive – Highly domain specific ( not portable to a new domain)
  • 51. General difficulties “ Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc. to become operations director of Arthur Andersen". • Capitalization useless for first word • S not part of name "Italy" • Date is "last Thursday" not "Thursday" • Milan is location, not organization • Arthur Andersen is organization, not person 03/04/25 51 IIIT Summer School
  • 52. Rules success and failure Title Capitalized_Word Title Person_Name Correct: Mr. Jones Incorrect: Mrs. Field's Cookies (corporation) Month_name number_less_than_32 Date Correct: February 28 Incorrect: Long March 3 (a Chinese Rocket) From Date to Date Date Correct: from August 3 to August 9 Incorrect: I moved my trip from April to June (two separate dates) 03/04/25 52 IIIT Summer School
  • 53. Statistical based approach • Need to identify features • Feature selection has to be correct for all types of NE • Development of Tagged Corpus • The Corpus should contain all types of tags in appropriate number • Domain based corpus has to be generated. 03/04/25 53 IIIT Summer School
  • 54. Automated approaches Address drawbacks of hand-coded system Automated training • Human-annotated (with desired output standards) training data • Annotation requires less effort and expertise than hand-coding rules • Annotation accuracy • Two annotators for checking, third annotator to resolve disputes 03/04/25 54 IIIT Summer School
  • 55. Literature Survey 1) Named Entity Recognition was one of the tasks defined in Message Understanding Conference(MUC) 6. 2) A survey on Named Entity Recognition was done by David Nadeau (2007). 3) Techniques used include: - rule based technique by Krupka (1998) - using maximum entropy by Borthwick (1998) - using Hidden Markov Model by Bikel (1997) - bootstrapping approach using concept based seeds (Niu et al., 2003) - hybrid approaches such as rule based tagging for certain entities such as date, time, percentage and maximum entropy based approach for entities like location and organization (Rohini et al.,2000) 4) The Stanford NER software (Finkel et al., 2005), uses linear chain CRFs in their NER engine. Here they identify three classes of NERs viz., Person, Organization and Location. 03/04/25 55 IIIT Summer School
  • 56.  Arulmozhi, P. and Sobha, L. (2006). HMM-based Part of Speech Tagger for Relatively Free  Word Order Language. Advances in Natural Language Processing, Research in Computing Science Journal, Mexico Volume18, pp. 37-48.  Bikel, D. M. Miller, S. Schwartz, R. Weischedel, R. (1997). Nymble: A high-performance learning name-finder. In Fifth Conference on Applied Natural Language Processing. pp. 194201.  Borthwick, A. Sterling, J. Agichtein, E. and Grishman, R. (1998). Description of the MENE named Entity System. In Seventh Machine Understanding Conference (MUC-7).  Chen, W. Zhang, Y. and Isahara, H. (2006). Chinese Named Entity Recognition with Conditional Random Fields. In Fifth SIGHAN Workshop on Chinese Language Processing, Sydney. pp.118-121.  Ekbal, A. Bandyopadhyay, S. (2009). A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1). pp.1-44. References 03/04/25 56 IIIT Summer School
  • 57.  Finkel, J. N. Grenager, T. and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005). pp. 363-370.  Finkel, J. Dingare, S. Nguyen, H. Nissim, M. Sinclair, G. and Manning, C. (2004). Exploiting Context for Biomedical Entity Recognition: from Syntax to the Web. In Joint Workshop on Natural Language Processing in Biomedicine and its Applications, (NLPBA), Geneva, Switzerland.  Gali, K. Surana, H. Vaidya, A. Shishtla, P. Sharma, D. M. (2008). Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition. In Workshop on NER for South and South East Asian Languages, IJCNLP-08, Hyderabad, India.  Kumar, K. N. Santosh, G. S. K. Varma, V. (2011). A Language-Independent Approach to Identify the Named Entities in under-resourced languages and Clustering Multilingual Documents. In International Conference on Multilingual and Multimodal Information Access Evaluation, University of Amsterdam, Netherlands.  Lafferty, J. McCallum, A. Pereira, F. (2001). Conditional Random Fields for segmenting and labeling sequence data. In ICML-01, pp. 282-289.  Loinaz, I.A. Uriarte, O. A. Ramos, N. E. Castro, M. I. F. D (2006). Lessons from the Development of Named Entity Recognizer for Basque. Natural Language Processing, 36. pp. 25 – 37.  McCallum, A. and Li, W. (2003). Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In Seventh Conference on Natural Language Learning (CoNLL). References 03/04/25 57 IIIT Summer School
  • 58.  Nadeau, David and Sekine, S. (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30(1). pp.3–26.  Niu, C. Li, W. Ding, J. Srihari, R. K. (2003). Bootstrapping for Named Entity Tagging using Concept-based Seeds. In HLT-NAACL’03, Companion Volume, Edmonton, AT. pp.73-75.  Pandian, S. Lakshmana, Geetha, T. V. and Krishna. (2007). Named Entity Recognition in Tamil using Context-cues and the E-M algorithm. In the Proceedings of the 3rd Indian International Conference on Artificial Intelligence, Pune, India. pp. 1951 -1958.  Sasidhar, B., Yohan, P.M., Babu, V.A., Govarhan, A.(2011). A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu. J. International Journal of Computer Science Issues, Volume. 8, pp. 1694-0814 .  Sobha, L., Vijay Sundar Ram. R. (2006). "Noun Phrase Chunker for Tamil", In Proceedings of Symposium on Modeling and Shallow Parsing of Indian Languages, Indian Institute of Technology, Mumbai, pp 194-198.  Srihari, R.K. Niu, C. Yu, L. (2000). A Hybrid Approach for Named Entity Recognition in Indian Languages. In 6th Applied Natural Language Conference, pp. 247-254  Gupta, S. and Bhattacharyya, P. (2010). Think globally, apply locally: using distributional characteristics for Hindi named entity identification. In 2010 Named Entities Workshop, Association for Computational Linguistics Stroudsburg, PA, USA  Vijayakrishna, R. and Sobha, L. (2008). Domain focused Named Entity for Tamil using Conditional Random Fields. In IJNLP-08 workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59-66 References 03/04/25 58 IIIT Summer School
  • 59. Literature Survey Indian Languages: 5) Named Entity recognition for Hindi, Bengali, Oriya, Telugu and Urdu (some of the major Indian languages) were addressed as a shared task in the NERSSEAL workshop of IJCNLP. The tagset used here consisted of 12 tags. 6) Vijayakrishna & Sobha (2008) worked on Domain focused Tamil Named Entity Recognizer for Tourism domain using CRF. It handles nested tagging of named entities with a hierarchical tag set containing 106 tags. They considered root of words, POS, combined word and POS, Dictionary of named entities as features to build the system. 7) Pandian et al (2007) have built a Tamil NER system using contextual cues and E-M algorithm. 8) The NER system (Gali et al., 2008) build for NERSSEAL-2008 shared task which combines the machine learning techniques with language specific heuristics. The system has been tested on five languages such as Telugu, Hindi, Bengali, Urdu and Oriya using CRF followed by post processing which involves some heuristics. 03/04/25 59 IIIT Summer School
  • 60. Thank you 03/04/25 60 IIIT Summer School