SlideShare a Scribd company logo
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
DOI : 10.5121/ijnlc.2013.2105 37
IDENTIFICATION AND CLASSIFICATION OF NAMED
ENTITIES IN INDIAN LANGUAGES
Sudha Morwal and Deepti Chopra
Department of Computer Science, Banasthali Vidyapith, Jaipur (Raj.), INDIA
sudha_morwal@yahoo.co.in
deeptichopra11@yahoo.co.in
ABSTRACT
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
KEYWORDS
Accuracy, HMM, Named Entities, NER, Performance Metrics
1. INTRODUCTION
Named Entities (NEs) are the proper nouns or the entities that represent the Name of Person,
Location, Organisation, River, Percentage, Quantity and Time etc. NER is used extensively in
Natural Language Processing. NER is the task to categorize all the NEs or the proper nouns in a
document into different NEs classes’ e. g Person, Location, Organisation, City, State, and
Country etc [1][2].
Consider an annotated sentence in Hindi:
“ूधानमंऽी/OTHER अटल बहारवाजपेयी/PER ने/OTHER कहा/OTHER है/OTHER क/OTHER
वमंऽी/OTHER यशवंतिसहा/PER ारा/OTHER ूःतुत/OTHER बजट/OTHER वकास/OTHER
क/OTHER गित/OTHER को/OTHER तेज/OTHER करेगा/OTHER ।/OTHER”
In the above sentence, the NER based system identifies the NEs and classify them into different
NEs classes. Here, अटल बहारवाजपेयी and यशवंतिसहा are the NEs and are Name of Persons so
these are allotted Name of Person tag (PER).The rest of the tokens are given OTHER tag since
these are not Named Entities.
Various applications of NER include – Information Retrieval, Information Extraction, Text
Summarization, Machine Translation, question answering system etc. Although, a lot of work has
been accomplished in NER in English, Chinese and Spanish etc. But, no significant amount of
work has been accomplished in NER in Indian languages (IL). This is due to many reasons.
Firstly, IL is free word order, inflectional as well as morphologically rich in nature. Secondly,
Unlike in English, there is no concept of Capitalisation in IL, in which capital letter is used to
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
38
detect NEs in a document. Thirdly, web lack in the resources in the Indian languages. Fourthly,
IL contains NEs which also lie in dictionary as common nouns. So, we need to resolve
ambiguities that arise in IL.
In this paper, we have discussed about NER based system for IL particularly for Hindi, Telugu
and Bengali using Hidden Markov Model (HMM).
2. LITERATURE REVIEW
Based on similar sounding property, a phonetic matching technique is developed to perform NER
in English and Hindi. Precision and Average recall obtained for English are-81.40% and 81.3%.
and for Hindi are-80.2% and 54.97% [4]. Annotated Corpora of Bengali and Hindi having
122,467 tokens and 502,974 tokens are used in performing NER using Support Vector Machine.
The testing is done on 35K and 60K tokens of Bengali and Hindi. Average recall, precision and f-
score values for Bengali are: 88.61%, 80.12% and 84.15%, whereas for Hindi, it is: 80.23%,
74.34% and 77.17% [12]. NER is performed in Telugu in two phases. In the first phase, Telugu
dictionaries, Noun Morphological Stemmer and Noun Suffixes are used to identify the Nouns. In
the second phase, transliterated gazetteer lists, various Named Entity suffix features, context
features and morphological features are used to identify the Named Entities. The accuracy
achieved using this approach is up to 95.37% [3]. Maximum Entropy (ME) approach and
Contextual information along with different orthographic word level features have been used for
performing NER in Telugu. Contextual word features refer to the preceding and the following
words of a given word. The Corpus of Telugu has been taken from Telugu Wikipedia, Telugu
Local dailies; iinaa Du. The Performance Metrics i.e. Precision, Recall and F-Measure obtained
is 75.89%, 53.35% and 61.62% [2]. It has been shown that the performance of the HMM model
of a hybrid system is better than the CRF model. The F-Measure values obtained using HMM
Model for languages such as Bengali, Hindi, Oriya, Telugu and Urdu are: 39.77, 46.84, 45.84,
46.58 and 44.73 [14].
3. IMPLEMENTATION AND RESULTS
We have developed a NER based system based on the Hidden Markov Model (HMM) approach
Fig1. The input to this NER based system is the raw text and its output is the Named Entity tags.
This NER based system performs in three phases. The first phase is referred to as ‘Annotation
phase’ that assists in producing tagged or annotated text from the raw text. The second phase is
referred to as ‘Train HMM’. In this phase, it computes the three most essential parameters of
HMM i.e. Start Probability, Emission Probability (B) and the Transition Probability (A)
[11][12][16]. The last phase is referred to as ‘TEST HMM’. In this phase, user gives certain test
sentences to the system, and based on the HMM parameters computed in the previous state,
Viterbi algorithm computes the optimal state sequence for the given test sentence.
Figure 1 NER in Indian languages using HMM
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
39
Mathematically, HMM parameters are given as follows:
A = aij = (Number of transitions from state si to sj) / (Number of transitions from state si).
B = bj (k) = (Number of times in state j and observing symbol k) / (expected number of times in
state j).
We have performed NER in Hindi, Bengali and Telugu. The details are mentioned in TABLE 1.
For Hindi, we considered Tourism Domain Corpus which was developed at Banasthali Vidyapith.
Also, we took Hindi, Bengali and Telugu Corpus from NLTK Indian Corpora. The Hindi text that
we took from NLTK was related to Politics and Sports. Bengali and Telugu texts from NLTK
were of general domain.
TABLE 1 Details of NER using HMM
We have also done training on 8,623 words or 540 sentences in Hindi obtained from NLTK
Indian Corpora. We considered 8 tags TABLE 2. Accuracy, Precision, Recall and F-Measure
reported is 96%
TABLE 2 Tags used in NER in Hindi sentences from NLTK Indian Corpora
SNO TAGS
1 PER (Name of Person)
2 LOC (Name of Location)
3 OTHER (Not a Named Entity)
4 CO (Name of Country)
5 MONTH
6 ORG (Name of Organization)
7 WEEK
8 PARTY (Name of Political Party)
S.N
O.
LANGUAGE SOURCE DEVELOPED
BY
WEBSITE DOMAIN
1 HINDI Banasthali
Vidyapith
Tourism
2 HINDI NLTK
Indian
Corpus
author: A
Kumaran
http://nltk.googlecode.com
/svn/trunk/nltk_data/index
.xml
Related to
Politics and
Sports News
3 BENGALI NLTK
Indian
Corpus
author: A
Kumaran
http://nltk.googlecode.com
/svn/trunk/nltk_data/index
.xml
General domain
and Related to
Countries,
Locations,
Animals,
languages etc.
4 TELUGU NLTK
Indian
Corpus
author: A
Kumaran
http://nltk.googlecode.com
/svn/trunk/nltk_data/index
.xml
General
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
40
We performed NER in Hindi. For training, we took 100 sentences or 2332 tokens from a Hindi
tourism corpus, developed at Banasthali Vidyapith. We annotated it using 10 tags mentioned in
TABLE 3. We obtained F-Measure of 93%.
TABLE 3.Tags used in NER in Hindi sentences from Tourism domain
SNO TAGS
1
PER (Name of Person)
2
LOC (Name of Location)
3
OTHER (Not a Named
Entity)
4
SPORT
5
TIME
6
MONTH
7
ORG (Name of
Organization)
8
VEH (Name of Vehicle)
9
QTY( Name of Quantity)
10
RIVER
Also, we performed training on 9996 words or 994 Telugu sentences of NLTK Indian Corpora. F-
Measure obtained is 98.6%. Table 4 shows the tags that we have used.
TABLE 4 Tags used in NER in Telugu sentences from NLTK Indian Corpora
SNO TAGS
1 PER (Name of Person)
2 LOC (Name of Location)
3 OTHER (Not a Named Entity)
4 CO (Name of Country)
5 SUBJECT (Name of Subject)
6 LANGUAGE
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
41
Next, we performed training on 10,303 words or 899 sentences of Bengali obtained from NLTK
Indian Corpora. F-Measure obtained is 98.5%. The tags used are shown in TABLE 5.
TABLE 5 Tags used in NER in Bengali sentences from NLTK Indian Corpora
So, finally the results we have obtained in terms of F-Measure are given below in TABLE 6
TABLE 6 Results obtained by NER using HMM
SNO LANGUAGE PRECISION RECALL F-MEASURE
1 Hindi(Tourism
Corpus)
93% 93% 93%
Hindi(General
Domain)
96% 96% 96%
2 Bengali 98.5% 98.5% 98.5%
3 Telugu 98.6% 98.6% 98.6%
4. PERFORMANCE METRICS
Performance Metrics is measure to estimate the performance of a NER based system.
Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and F-
Measure [10] [6]. Consider the following terms:
Response (R): It may be defined as the output of a NER based system.
Answer Key (A): The interpretation of human may be termed as Answer Key
Response –Answer key (RA): The output of a NER based system as well as the interpretation of
human.
Hence, we define Precision, Recall and F-Measure as follows: [2] [3] [9]
Precision (P): RA/R
SNO TAGS
1 PER (Name of Person)
2 LOC (Name of Location)
3 OTHER (Not a Named Entity)
4 CO (Name of Country)
5 ANIMAL (Name of Animal)
6 RIGHT (Name of Fundamental
Right)
7 DAIRY (Name of Dairy)
8 SONG
9 SPICES (Name of Spices)
10 MONTH
11 LANGUAGE
12 PANCHAYAT (Name of
Panchayat)
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
42
Recall (R): RA/A
F-Measure: (2 * P * R) / (P + R)
5. CONCLUSION
In this paper, we have discussed about NER, some of the work that has been already been
accomplished in NER using different approaches, Performance Metrics and the results that we
have obtained by performing NER in Hindi, Bengali and Telugu using Hidden Markov Model.
The F-Measure that we have obtained is 96%, 98% and 98.5% in Hindi, Bengali and Telugu.
Also, it denotes that as the amount of training increases in Hidden Markov Model, then the
accuracy and the F-Measure also increases.
ACKNOWLEDGEMENT
We would like to thank all those who helped us in accomplishing this task.
REFERENCES
[1] Kamaldeep Kaur, Vishal Gupta.” Name Entity Recognition for Punjabi Language” IRACST -
International Journal of Computer Science and Information Technology  Security (IJCSITS), ISSN:
2249-9555 .Vol. 2, No.3, June 2012
[2] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity
Recognition for Telugu Using Maximum Entropy Model”
[3] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of
Computer Science Issues, Vol. 8, Issue 2, March 2011.
[4] Animesh Nayan,, B. Ravi Kiran Rao, Pawandeep Singh,Sudip Sanyal and Ratna Sanya “Named Entity
Recognition for Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South
and South East Asian Languages ,Hyderabad (India) pp. 97–104, 2008. Available at:
http://www.aclweb.org/anthology-new/I/I08/I08-5014.pdf
[5] Sujan Kumar Saha Sanjay Chatterji Sandipan Dandapat. “A Hybrid Approach for Named Entity
Recognition in Indian Languages”
[6] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay
“Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the
IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad,
India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf
[7] Vishal Gupta, Gurpreet Singh Lehal “Named Entity Recognition for Punjabi Language Text
Summarization” International Journal of Computer Applications (0975 – 8887) Vpl.33 No.3, Nov.
2011
[8] S. Biswas, M. K. Mishra, Sitanath_biswas, S. Acharya, S. Mohanty “A Two Stage Language
Independent Named Entity Recognition for Indian Languages” (IJCSIT) International Journal of
Computer Science and Information Technologies, Vol. 1 (4) , 2010, 285-289.
[9] Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian
Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November
2010.
[10] Shilpi Srivastava, Mukund Sanglikar  D.C Kothari. ”Named Entity Recognition System for Hindi
Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume
(2) : Issue (1) : 2011.Available at
http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
[11] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition, In Proceedings of the IEEE, 77 (2), p. 257-286February 1989.Available at:
http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013
43
[12] Asif Ekbal and Sivaji Bandyopadhyay .“Named Entity Recognition using Support Vector Machine: A
Language Independent Approach” International Journal of Electrical and Electronics Engineering 4:2
2010. Available at: http://www.waset.org/journals/ijeee/v4/v4-2-19.pdf
[13] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D. Spyropoulos.”Learning
Decision Trees for Named-Entity Recognition and Classification” Available at:
http://users.iit.demokritos.gr/~petasis/Publications/Papers/ECAI-2000.pdf
[14] Hideki Isozaki “Japanese Named Entity Recognition based on a Simple Rule Generator and Decision
Tree Learning” .Available at: http://acl.ldc.upenn.edu/acl2001/MAIN/ISOZAKI.PDF
[15] Padmaja Sharma, Utpal Sharma, and Jugal Kalita”Named Entity Recognition: A Survey for the Indian
Languages. ” . (LANGUAGE IN INDIA. Strength for Today and Bright Hope for Tomorrow .Volume
11: 5 May 2011 ISSN 1930-2940. ) Available at:
http://www.languageinindia.com/may2011/v11i5may2011.pdf
[16] S. Pandian, K. A. Pavithra, and T. Geetha, “Hybrid Three-stage Named Entity Recognizer for Tamil,”
INFOS2008, March Cairo-Egypt. Available at: http://infos2008.fci.cu.edu.eg/infos/NLP_08_P045-
052.pdf
[17] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D. Spyropoulos.”Learning
Decision Trees for Named-Entity Recognition and Classification”
Available at: http://users.iit.demokritos.gr/~petasis/Publications/Papers/ECAI-2000.pdf
[18] Sujan Kumar Saha, Sudeshna Sarkar, Pabitra Mitra “Gazetteer Preparation for Named Entity
Recognition in Indian Languages”.Available at: http://www.aclweb.org/anthology-new/I/I08/I08-
7002.pdf
[19] James Mayfield and Paul McNamee and Christine Piatko “Named Entity Recognition using Hundreds
of Thousands of Features”. Available at: http://acl.ldc.upenn.edu/W/W03/W03-0429.pdf
[20] Praveen Kumar P and Ravi Kiran V” A Hybrid Named Entity Recognition System for South Asian
Languages”. Available at-http://www.aclweb.org/anthology-new/I/I08/I08-5012.pdf
AUTHORS
Sudha Morwal is an active researcher in the field of Natural Language Processing.
Currently working as Associate Professor in the Department of Computer Science at
Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) ,
NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University
(Rajasthan), India. She has published many papers in International Conferences and
Journals.
Deepti Chopra received B.Tech degree in Computer Science and Engineering from
Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.Currently she is
pursuing her M.Tech degree in Computer Science and Engineering from Banasthali
University, Rajasthan. Her research interests include Artificial Intelligence, Natural
Language Processing, and Information Retrieval. She has published many papers in
International journals and conferences.

More Related Content

PDF
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
PDF
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
PDF
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
PDF
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
PDF
Quality estimation of machine translation outputs through stemming
PDF
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
PDF
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
PDF
Named Entity Recognition for Telugu Using Conditional Random Field
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
Quality estimation of machine translation outputs through stemming
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
Named Entity Recognition for Telugu Using Conditional Random Field

Similar to Identification and Classification of Named Entities in Indian Languages (20)

PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
Handling ambiguities and unknown words in named entity recognition using anap...
PDF
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
PDF
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
PDF
FIRE2014_IIT-P
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
PDF
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
PDF
Named Entity Recognition System for Hindi Language: A Hybrid Approach
PDF
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
PDF
HMM BASED POS TAGGER FOR HINDI
PDF
A survey of named entity recognition in assamese and other indian languages
PDF
D3 dhanalakshmi
PDF
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PDF
A Review on a web based Punjabi t o English Machine Transliteration System
PDF
A Review on a web based Punjabi to English Machine Transliteration System
PDF
Towards Building Semantic Role Labeler for Indian Languages
Survey on Indian CLIR and MT systems in Marathi Language
Handling ambiguities and unknown words in named entity recognition using anap...
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
FIRE2014_IIT-P
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
Named Entity Recognition System for Hindi Language: A Hybrid Approach
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HMM BASED POS TAGGER FOR HINDI
A survey of named entity recognition in assamese and other indian languages
D3 dhanalakshmi
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi to English Machine Transliteration System
Towards Building Semantic Role Labeler for Indian Languages

More from kevig (20)

PDF
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
Call For Papers - 3rd International Conference on NLP & Signal Processing (NL...
PDF
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMM...
PDF
Call For Papers- 14th International Conference on Natural Language Processing...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
Call For Papers - 6th International Conference on Natural Language Processing...
PDF
July 2025 Top 10 Download Article in Natural Language Computing.pdf
PDF
Orchestrating Multi-Agent Systems for Multi-Source Information Retrieval and ...
PDF
Call For Papers - 6th International Conference On NLP Trends & Technologies (...
PDF
Call For Papers - 6th International Conference on Natural Language Computing ...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)...
PDF
Call For Papers - 4th International Conference on NLP and Machine Learning Tr...
PDF
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
UNIQUE APPROACH TO CONTROL SPEECH, SENSORY AND MOTOR NEURONAL DISORDER THROUG...
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
Call For Papers - 3rd International Conference on NLP & Signal Processing (NL...
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMM...
Call For Papers- 14th International Conference on Natural Language Processing...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
Call For Papers - 6th International Conference on Natural Language Processing...
July 2025 Top 10 Download Article in Natural Language Computing.pdf
Orchestrating Multi-Agent Systems for Multi-Source Information Retrieval and ...
Call For Papers - 6th International Conference On NLP Trends & Technologies (...
Call For Papers - 6th International Conference on Natural Language Computing ...
Call For Papers - International Journal on Natural Language Computing (IJNLC)...
Call For Papers - 4th International Conference on NLP and Machine Learning Tr...
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Call For Papers - International Journal on Natural Language Computing (IJNLC)
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
UNIQUE APPROACH TO CONTROL SPEECH, SENSORY AND MOTOR NEURONAL DISORDER THROUG...

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Identification and Classification of Named Entities in Indian Languages

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 DOI : 10.5121/ijnlc.2013.2105 37 IDENTIFICATION AND CLASSIFICATION OF NAMED ENTITIES IN INDIAN LANGUAGES Sudha Morwal and Deepti Chopra Department of Computer Science, Banasthali Vidyapith, Jaipur (Raj.), INDIA sudha_morwal@yahoo.co.in deeptichopra11@yahoo.co.in ABSTRACT The process of identification of Named Entities (NEs) in a given document and then there classification into different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by English and the European languages. In this paper, we have presented the results that we have achieved by performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance Metrics. KEYWORDS Accuracy, HMM, Named Entities, NER, Performance Metrics 1. INTRODUCTION Named Entities (NEs) are the proper nouns or the entities that represent the Name of Person, Location, Organisation, River, Percentage, Quantity and Time etc. NER is used extensively in Natural Language Processing. NER is the task to categorize all the NEs or the proper nouns in a document into different NEs classes’ e. g Person, Location, Organisation, City, State, and Country etc [1][2]. Consider an annotated sentence in Hindi: “ूधानमंऽी/OTHER अटल बहारवाजपेयी/PER ने/OTHER कहा/OTHER है/OTHER क/OTHER वमंऽी/OTHER यशवंतिसहा/PER ारा/OTHER ूःतुत/OTHER बजट/OTHER वकास/OTHER क/OTHER गित/OTHER को/OTHER तेज/OTHER करेगा/OTHER ।/OTHER” In the above sentence, the NER based system identifies the NEs and classify them into different NEs classes. Here, अटल बहारवाजपेयी and यशवंतिसहा are the NEs and are Name of Persons so these are allotted Name of Person tag (PER).The rest of the tokens are given OTHER tag since these are not Named Entities. Various applications of NER include – Information Retrieval, Information Extraction, Text Summarization, Machine Translation, question answering system etc. Although, a lot of work has been accomplished in NER in English, Chinese and Spanish etc. But, no significant amount of work has been accomplished in NER in Indian languages (IL). This is due to many reasons. Firstly, IL is free word order, inflectional as well as morphologically rich in nature. Secondly, Unlike in English, there is no concept of Capitalisation in IL, in which capital letter is used to
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 38 detect NEs in a document. Thirdly, web lack in the resources in the Indian languages. Fourthly, IL contains NEs which also lie in dictionary as common nouns. So, we need to resolve ambiguities that arise in IL. In this paper, we have discussed about NER based system for IL particularly for Hindi, Telugu and Bengali using Hidden Markov Model (HMM). 2. LITERATURE REVIEW Based on similar sounding property, a phonetic matching technique is developed to perform NER in English and Hindi. Precision and Average recall obtained for English are-81.40% and 81.3%. and for Hindi are-80.2% and 54.97% [4]. Annotated Corpora of Bengali and Hindi having 122,467 tokens and 502,974 tokens are used in performing NER using Support Vector Machine. The testing is done on 35K and 60K tokens of Bengali and Hindi. Average recall, precision and f- score values for Bengali are: 88.61%, 80.12% and 84.15%, whereas for Hindi, it is: 80.23%, 74.34% and 77.17% [12]. NER is performed in Telugu in two phases. In the first phase, Telugu dictionaries, Noun Morphological Stemmer and Noun Suffixes are used to identify the Nouns. In the second phase, transliterated gazetteer lists, various Named Entity suffix features, context features and morphological features are used to identify the Named Entities. The accuracy achieved using this approach is up to 95.37% [3]. Maximum Entropy (ME) approach and Contextual information along with different orthographic word level features have been used for performing NER in Telugu. Contextual word features refer to the preceding and the following words of a given word. The Corpus of Telugu has been taken from Telugu Wikipedia, Telugu Local dailies; iinaa Du. The Performance Metrics i.e. Precision, Recall and F-Measure obtained is 75.89%, 53.35% and 61.62% [2]. It has been shown that the performance of the HMM model of a hybrid system is better than the CRF model. The F-Measure values obtained using HMM Model for languages such as Bengali, Hindi, Oriya, Telugu and Urdu are: 39.77, 46.84, 45.84, 46.58 and 44.73 [14]. 3. IMPLEMENTATION AND RESULTS We have developed a NER based system based on the Hidden Markov Model (HMM) approach Fig1. The input to this NER based system is the raw text and its output is the Named Entity tags. This NER based system performs in three phases. The first phase is referred to as ‘Annotation phase’ that assists in producing tagged or annotated text from the raw text. The second phase is referred to as ‘Train HMM’. In this phase, it computes the three most essential parameters of HMM i.e. Start Probability, Emission Probability (B) and the Transition Probability (A) [11][12][16]. The last phase is referred to as ‘TEST HMM’. In this phase, user gives certain test sentences to the system, and based on the HMM parameters computed in the previous state, Viterbi algorithm computes the optimal state sequence for the given test sentence. Figure 1 NER in Indian languages using HMM
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 39 Mathematically, HMM parameters are given as follows: A = aij = (Number of transitions from state si to sj) / (Number of transitions from state si). B = bj (k) = (Number of times in state j and observing symbol k) / (expected number of times in state j). We have performed NER in Hindi, Bengali and Telugu. The details are mentioned in TABLE 1. For Hindi, we considered Tourism Domain Corpus which was developed at Banasthali Vidyapith. Also, we took Hindi, Bengali and Telugu Corpus from NLTK Indian Corpora. The Hindi text that we took from NLTK was related to Politics and Sports. Bengali and Telugu texts from NLTK were of general domain. TABLE 1 Details of NER using HMM We have also done training on 8,623 words or 540 sentences in Hindi obtained from NLTK Indian Corpora. We considered 8 tags TABLE 2. Accuracy, Precision, Recall and F-Measure reported is 96% TABLE 2 Tags used in NER in Hindi sentences from NLTK Indian Corpora SNO TAGS 1 PER (Name of Person) 2 LOC (Name of Location) 3 OTHER (Not a Named Entity) 4 CO (Name of Country) 5 MONTH 6 ORG (Name of Organization) 7 WEEK 8 PARTY (Name of Political Party) S.N O. LANGUAGE SOURCE DEVELOPED BY WEBSITE DOMAIN 1 HINDI Banasthali Vidyapith Tourism 2 HINDI NLTK Indian Corpus author: A Kumaran http://nltk.googlecode.com /svn/trunk/nltk_data/index .xml Related to Politics and Sports News 3 BENGALI NLTK Indian Corpus author: A Kumaran http://nltk.googlecode.com /svn/trunk/nltk_data/index .xml General domain and Related to Countries, Locations, Animals, languages etc. 4 TELUGU NLTK Indian Corpus author: A Kumaran http://nltk.googlecode.com /svn/trunk/nltk_data/index .xml General
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 40 We performed NER in Hindi. For training, we took 100 sentences or 2332 tokens from a Hindi tourism corpus, developed at Banasthali Vidyapith. We annotated it using 10 tags mentioned in TABLE 3. We obtained F-Measure of 93%. TABLE 3.Tags used in NER in Hindi sentences from Tourism domain SNO TAGS 1 PER (Name of Person) 2 LOC (Name of Location) 3 OTHER (Not a Named Entity) 4 SPORT 5 TIME 6 MONTH 7 ORG (Name of Organization) 8 VEH (Name of Vehicle) 9 QTY( Name of Quantity) 10 RIVER Also, we performed training on 9996 words or 994 Telugu sentences of NLTK Indian Corpora. F- Measure obtained is 98.6%. Table 4 shows the tags that we have used. TABLE 4 Tags used in NER in Telugu sentences from NLTK Indian Corpora SNO TAGS 1 PER (Name of Person) 2 LOC (Name of Location) 3 OTHER (Not a Named Entity) 4 CO (Name of Country) 5 SUBJECT (Name of Subject) 6 LANGUAGE
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 41 Next, we performed training on 10,303 words or 899 sentences of Bengali obtained from NLTK Indian Corpora. F-Measure obtained is 98.5%. The tags used are shown in TABLE 5. TABLE 5 Tags used in NER in Bengali sentences from NLTK Indian Corpora So, finally the results we have obtained in terms of F-Measure are given below in TABLE 6 TABLE 6 Results obtained by NER using HMM SNO LANGUAGE PRECISION RECALL F-MEASURE 1 Hindi(Tourism Corpus) 93% 93% 93% Hindi(General Domain) 96% 96% 96% 2 Bengali 98.5% 98.5% 98.5% 3 Telugu 98.6% 98.6% 98.6% 4. PERFORMANCE METRICS Performance Metrics is measure to estimate the performance of a NER based system. Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and F- Measure [10] [6]. Consider the following terms: Response (R): It may be defined as the output of a NER based system. Answer Key (A): The interpretation of human may be termed as Answer Key Response –Answer key (RA): The output of a NER based system as well as the interpretation of human. Hence, we define Precision, Recall and F-Measure as follows: [2] [3] [9] Precision (P): RA/R SNO TAGS 1 PER (Name of Person) 2 LOC (Name of Location) 3 OTHER (Not a Named Entity) 4 CO (Name of Country) 5 ANIMAL (Name of Animal) 6 RIGHT (Name of Fundamental Right) 7 DAIRY (Name of Dairy) 8 SONG 9 SPICES (Name of Spices) 10 MONTH 11 LANGUAGE 12 PANCHAYAT (Name of Panchayat)
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 42 Recall (R): RA/A F-Measure: (2 * P * R) / (P + R) 5. CONCLUSION In this paper, we have discussed about NER, some of the work that has been already been accomplished in NER using different approaches, Performance Metrics and the results that we have obtained by performing NER in Hindi, Bengali and Telugu using Hidden Markov Model. The F-Measure that we have obtained is 96%, 98% and 98.5% in Hindi, Bengali and Telugu. Also, it denotes that as the amount of training increases in Hidden Markov Model, then the accuracy and the F-Measure also increases. ACKNOWLEDGEMENT We would like to thank all those who helped us in accomplishing this task. REFERENCES [1] Kamaldeep Kaur, Vishal Gupta.” Name Entity Recognition for Punjabi Language” IRACST - International Journal of Computer Science and Information Technology Security (IJCSITS), ISSN: 2249-9555 .Vol. 2, No.3, June 2012 [2] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity Recognition for Telugu Using Maximum Entropy Model” [3] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011. [4] Animesh Nayan,, B. Ravi Kiran Rao, Pawandeep Singh,Sudip Sanyal and Ratna Sanya “Named Entity Recognition for Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages ,Hyderabad (India) pp. 97–104, 2008. Available at: http://www.aclweb.org/anthology-new/I/I08/I08-5014.pdf [5] Sujan Kumar Saha Sanjay Chatterji Sandipan Dandapat. “A Hybrid Approach for Named Entity Recognition in Indian Languages” [6] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay “Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf [7] Vishal Gupta, Gurpreet Singh Lehal “Named Entity Recognition for Punjabi Language Text Summarization” International Journal of Computer Applications (0975 – 8887) Vpl.33 No.3, Nov. 2011 [8] S. Biswas, M. K. Mishra, Sitanath_biswas, S. Acharya, S. Mohanty “A Two Stage Language Independent Named Entity Recognition for Indian Languages” (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 1 (4) , 2010, 285-289. [9] Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. [10] Shilpi Srivastava, Mukund Sanglikar D.C Kothari. ”Named Entity Recognition System for Hindi Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011.Available at http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf [11] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In Proceedings of the IEEE, 77 (2), p. 257-286February 1989.Available at: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.1, February 2013 43 [12] Asif Ekbal and Sivaji Bandyopadhyay .“Named Entity Recognition using Support Vector Machine: A Language Independent Approach” International Journal of Electrical and Electronics Engineering 4:2 2010. Available at: http://www.waset.org/journals/ijeee/v4/v4-2-19.pdf [13] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D. Spyropoulos.”Learning Decision Trees for Named-Entity Recognition and Classification” Available at: http://users.iit.demokritos.gr/~petasis/Publications/Papers/ECAI-2000.pdf [14] Hideki Isozaki “Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning” .Available at: http://acl.ldc.upenn.edu/acl2001/MAIN/ISOZAKI.PDF [15] Padmaja Sharma, Utpal Sharma, and Jugal Kalita”Named Entity Recognition: A Survey for the Indian Languages. ” . (LANGUAGE IN INDIA. Strength for Today and Bright Hope for Tomorrow .Volume 11: 5 May 2011 ISSN 1930-2940. ) Available at: http://www.languageinindia.com/may2011/v11i5may2011.pdf [16] S. Pandian, K. A. Pavithra, and T. Geetha, “Hybrid Three-stage Named Entity Recognizer for Tamil,” INFOS2008, March Cairo-Egypt. Available at: http://infos2008.fci.cu.edu.eg/infos/NLP_08_P045- 052.pdf [17] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D. Spyropoulos.”Learning Decision Trees for Named-Entity Recognition and Classification” Available at: http://users.iit.demokritos.gr/~petasis/Publications/Papers/ECAI-2000.pdf [18] Sujan Kumar Saha, Sudeshna Sarkar, Pabitra Mitra “Gazetteer Preparation for Named Entity Recognition in Indian Languages”.Available at: http://www.aclweb.org/anthology-new/I/I08/I08- 7002.pdf [19] James Mayfield and Paul McNamee and Christine Piatko “Named Entity Recognition using Hundreds of Thousands of Features”. Available at: http://acl.ldc.upenn.edu/W/W03/W03-0429.pdf [20] Praveen Kumar P and Ravi Kiran V” A Hybrid Named Entity Recognition System for South Asian Languages”. Available at-http://www.aclweb.org/anthology-new/I/I08/I08-5012.pdf AUTHORS Sudha Morwal is an active researcher in the field of Natural Language Processing. Currently working as Associate Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) , NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University (Rajasthan), India. She has published many papers in International Conferences and Journals. Deepti Chopra received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.Currently she is pursuing her M.Tech degree in Computer Science and Engineering from Banasthali University, Rajasthan. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences.