Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
DOI:10.5121/acij.2014.5403 17
A DECISION TREE BASED WORD SENSE
DISAMBIGUATION SYSTEM IN MANIPURI
LANGUAGE
Richard Laishram Singh1
, Krishnendu Ghosh1
, Kishorjit Nongmeikapam2
and
Sivaji Bandyopadhyay3
1
School of Computer Engineering, Kalinga Institute of Industrial Technology,
Bhubaneswar, India
2
Department of Computer Science and Engineering, Manipur Institute of Technology
Manipur, India
3
Dept of Computer Science and Engineering, Jadavpur University, Kolkata, India
ABSTRACT
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri
language. The paper discusses related attempts made in the Manipuri language followed by the proposed
plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study.
Conventional positional and context based features are suggested to capture the sense of the words, which
have ambiguous and multiple senses. The proposed work is expected to predict the senses of the
polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The
system produces an accuracy of 71.75 %.
KEYWORDS
Word sense disambiguation(WSD); Classification and regression tree (CART); polysemous word;
Manipuri.
1. INTRODUCTION
Word sense disambiguation (WSD) is the technique to disambiguate between the senses of a
word [1]. It has been recognized as one of the major problems in the field of natural language
processing [2]. Word sense disambiguation is not an isolated system by itself, but can be fitted
with other important tasks such as, information retrieval, machine translation, speech processing,
parts-of-speech tagging and text processing [3, 4]. It has been noted that, the sense of a word is
determined based on its syntactic, positional, contextual and other relevant factors. In different
scenario, the sense or the meaning of a single word can be different. Such words are known as
polysemous words. The current study suggests development of an automatic WSD system [5]
with the help of a knowledge acquisition technique.
The goal of the present work is to develop a system which can disambiguate between the different
senses of a polysemous word in Manipuri language. In European languages, efficient and
automatic WSD systems are present. In Indian languages, such systems are mostly rule-based due
to lack of standardized database and presence of the proper knowledge acquisition tools. So far,
negligible amount of work has been reported in Manipuri language. That motivates the present
attempt to collect a suitable database consisting of every possible contexts and senses, used in
day-to-day life.
Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
18
The current study reports mainly the development issues of a WSD system in Manipuri language.
Manipuri is basically a Tibeto-Burman language, spoken mainly in the valley of Manipur, a
North-Eastern state of India. Hence, the syntactic and semantic structures of the language are
different from other Indian languages. The task of disambiguating the senses in Manipuri
language poses a challenge due to its agglutinative, reduplicative, compounding and tonal nature
[6]. The language uses the traditional Meitei-Mayek script for general use. But, that script is not
used in the current study due of presence of limited words and structures. The present work uses
Bengali script for developing the database. The lack of a standard font made the task even more
challenging. Besides, limited amount of works are reported in Manipuri language in the field of
natural language processing such as, POS tagging [7], Transliteration [8] and negligible amount
in WSD.
The paper is organized as follows: related literatures are discussed in section 2. The details of the
corpus are mentioned in section 3 The proposed architecture of the word sense disambiguation
module is discussed in the section 4 followed by the conclusion and direction towards future work
in section 5. Finally a few important references are added.
2. LITERATURE SURVEY
Several works on WSD are reported in English and other European languages. On the
contrary, limited works have been noted for Indian languages. The approaches are classified in
five groups: (i) selection restriction based disambiguation, (ii) machine learning based approach,
(iii) supervised learning approach, (iv) bootstrapping approach and (v) dictionary based
disambiguation.
2.1 Selection restriction based disambiguation
Selection restriction and type hierarchies are the primary knowledge sources used to disambiguate
the senses of the words in most of the integrated approaches. In an integrated rule based approach
for semantic analysis [9], selection restrictions block the formation of component meaning
representations containing the selection restriction violations [1]. For Indian languages, where the
semantic and syntactic structures are not analyzed properly and the required knowledge sources
are missing, rule based processes generally are not suggested.
2.2 Machine learning based approach
The template is used to format your paper and style the text. All margins, column widths, line
spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For
example, the head margin in this template measures proportionately more than is customary. This
measurement and others are deliberate, using specifications that anticipate your paper as one part
of the entire proceedings, and not as an independent document. Please do not revise any of the
current designations.
2.3 Supervised learning approach
Supervised learning is a specific category of the machine learning approaches to train the
classifier from supervised (labelled) training data [1]. The classifier is fed with the training data
consisting of the exemplars along with the output labels. The classifier predicts the correct output
value for any valid test input object.
Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
19
2.4 Bootstrapping approach
Bootstrapping approach is an incrementive machine learning based approach, widely used for
developing word sense disambiguation systems. Bootstrapping approach eliminates need for a
large training set by relying on a relatively less number of instances called seeds. Seeds are used
for training and testing. The generated pair of test patterns and the predicted output is later added
with the seeds to populate the training data. This process is very useful, if the base database is
comparatively small in size.
2.5 Bootstrapping approach
Dictionary based disambiguation approaches use a dictionary which provides both the means of
constructing a sense tagger and the target sense [1, 10]. In order to perform a large scale
disambiguation, machine readable dictionaries (MRD) are used to automate the task. This process
is useful for the languages having a digitalized machine readable dictionary. Due to the lack of
standard database and related knowledge acquisition tools, such approaches can hardly be used
for the focus Manipuri or other Indian languages.
3. DATABASE COLLECTION
The database used in the current study contains a total of 672 Manipuri sentences. These
sentences are collected from a local newspaper "The Sangai Express". The sentences contain a
total of 13,167 words and around 2,000 polysemous words. Polysemous words are the focus of
the current study as they contain more than one sense based on contexts and other factors. The
sentences are chosen from the news domains to accommodate day-to-day words and senses
mostly. The entire database, due to lack of digital resources and standard font are typed manually
[11]. The database is thoroughly checked for inconsistencies present in spelling, spacing and,
punctuation. The polysemous words are then tagged with the sense manually.
4. PROPOSED ARCHITECTURE
The current study attempts to find a solution to disambiguate between the possible senses of
Manipuri words. A base database is collected with sentences consisting of day-to-day senses. The
database is then preprocessed and suggested features are collected out of it. Then, a classifier is
developed with the knowledge obtained from the features using a learning algorithm. The
classifier finally predicts the sense of the test sentences.
The word sense disambiguation system works in two phases: (i) training phase and (ii) testing
phase. In the training phase, the training data is first preprocessed and features are generated. The
features are used then to train the classifier based on any learning algorithm. On the contrary, in
the testing phase, the test data is preprocessed and features are generated. The features of the test
data are fed to the classifier and the predicted output is compared for performance evaluation of
the system. Hence, the proposed architecture of the word sense disambiguation module for
Manipuri language contains five building blocks: (i) preprocessing, (ii) feature selection and
generation and (iii) training, (iv) testing and (v) performance evaluation.
A. Preprocessing
Preprocessing is the module where the raw data is processed so that the features can be generated
from the training or test data efficiently. Data is converted to corresponding ASCII format after
Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
20
the inconsistencies (like spelling, punctuations and spacing mistakes) are removed. When the data
is converted to ASCII format, the selected features are generated semi-automatically.
B. Feature Selection and Generation
In this phase, the ASCII format generated database is used to build the feature. The current
study, as it is in the preliminary attempt, uses a set of very common, widely popular and easily
extractable features. A total of 6 features are taken to build feature:
(i) the focus word for which the sense is to be derived,
(ii) the normalized position of the word in the sentence,
(iii) the previous word,
(iv) the previous-to-previous word,
(v) the next word,
(vi) the next to next word.
A 5-gram window is formed using the pair of the focus word and its context words which forms
the context information. A focus word, based on the context may have different senses. Hence, in
order to disambiguate the sense of the focused word, the contextual information is very much
necessary [13], and helps in predicting the correct one. In the current study positional feature is
suggested because of the lack of other relevant morphological features. As the syntactic and
semantic structures of a sentence remain mostly similar for a particular language, this feature
contains probable morphological information. To generate the final input feature vector, from the
database mentioned above mentioned six features are collected automatically.
C. Training
In the present work, a supervised learning algorithm is proposed based on decision tree algorithm.
In supervised approaches, the classifier is developed with the help of the input feature vector and
the output labels of the data. Hence, the final feature vector is developed using the six features
mentioned in section B and the output sense of the focus word. The sense of the focus word is
derived manually and finally the seven entries are fed to the classifier. The classifier is trained
with the decision tree algorithm.
The decision tree is a simple and linear decision making approach. In the current work,
classification and regression tree (CART) based algorithm is suggested to train the classifier.
Binary decision tree is developed automatically using the seven entries of the focus word present
in training data. CART uses all the instances of the training data and asks binary questions on the
features. CART algorithm selects the most predictive feature from the feature set and the best
possible question to achieve accurate classification of the training. Based on the feature and its
distribution of values, the training data is segmented into two parts. This segmentation is carried
out based on each of the features and finally at the leaves, a output class is achieved. CART has
been reported as one of the suitable learning algorithms for developing supervised learning
models for Indian languages like Manipuri whose characteristics have not studied in detail. An
example of a CART is given in Fig 1.
Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
21
Fig 1: Flow Chart of CART based word sense disambiguation
D. Testing
During the testing, CART offers prediction by comparing the features for the test case with the
trained decision tree. While predicting the sense for a test word, the corresponding features are
generated and compared with the trained decision tree. Starting from the root node asking a
question on individual features, the features of the test word will lead to a leaf which offers an
approximated output.
The features are already generated in the feature generation phase. The trained CART is fed with
features are test data and corresponding approximations are noted. The predictions are later
compared with the correct sense tags to perform evaluation of the current system.
E. Performance Evaluation
The accuracy of the proposed word sense disambiguation system is evaluated using an objective
analysis. The analysis is carried out by determining the correct sense prediction percentage for the
test words.
Being trained on 1,600 words and tested on 400 words using 7 features, the word sense
disambiguation system predicts the senses with 71.75% accuracy. The performance of the
proposed model is discussed in TABLE 1.
TABLE I: Objective Evaluation of CART based word sense disambiguation model
5. CONCLUSION AND FUTURE WORK
This paper proposed a preliminary attempt to predict the senses of the words for Manipuri
language. A set of positional and contextual features are suggested for developing the word sense
disambiguation system. Due to unavailability of morphological and other relevant linguistic
Predicted
Sense
Correctly
Predicted Sense
Accuracy
400 287 71.75%
Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014
22
knowledge, the accuracy of the model is lower than the models built for languages like English.
Moreover, the size of training data or the features used are not sufficient to achieve the
classification quality of human transcribers. A relevant and appropriate feature set can achieve
higher accuracy. Further improvements are expected by:
1. Joining a rule-based model with the CART based data-driven model to improve accuracy of the
word sense disambiguation system.
2. Exploring other well-known classifiers such as ANN, SVM or other probabilistic methods like
HMM and N-gram models.
ACKNOWLEDGEMENTS
The work presented in this paper is performed at KIIT University as a project for the fullfilment
of the degree of Masters of Technology. Special thanks to Mr. Kishorjit N. Singh for his help in
data collection.
REFERENCES
[1] D.Jurafsky, J.H.Martin, “Speech and Language Processing,”, Pearson Publishers, December, pp. 658-
672
[2] R.Mihalcea, D. Moldovan, “Word Sense Disambiguation based on Semantic Density”, in proceedings
of Seventh International conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, pp.
1492-1496.
[3] M.Sinha, M.K.Reddy, P.Bhattacharyya, P.Pandey and L.Kashyap, “Word Sense Disambiguation,” in
proceedings of International Journal of Computer Applications (IJCA),2010, vol. 5.9, pp. 25-32.
[4] R.Mihalcea and D. Moldovan, “A Method for Disambiguating Word Senses in a Large Corpus,” in
proceedings of COLING/ACL Workshop on Usage of WordNet in Natural Language Processing,
Computers and the Humanities, 1992, volume 26, no. 5-6,pp. 415-439.
[5] C.Leacock, G.Towell and E.Voorhees, “Corpus-Based Statistical Sense Resolution,” ARPA
Workshop on Human Language Technology, 1993.
[6] T.D.Singh and S.Bandyopadhyay, “Word Word Class and Sentence Type Identification in Manipuri
Morphological Analyzer,” in proceedings of MSPIL, 2006.
[7 ]K.Nongmeikapam, L.Nonglenjaoba, Y.Nirmal and S.Bandhyopadhyay, “Improvement of CRF based
Manipuri POS tagger by using Reduplicated MWE (RMWE),” arXiv preprint arXiv: 1111.2399,
2011.
[8] K.Nongmeikapam and S. Bandhyopadhyay, “A Transliteration of CRF Based Manipuri POS
Tagging,” in proceedings of Procedia Technology, Elsevier, 2012, pp. 582-589.
[9] M S. Olsen, “WordNet Word sense Disambiguation using an Automatically Generated Ontology,”
Class of 2003 Senior Conference on Natural Language Processing, 2003.
[10] J.Veronis and M. Nancy, “Word Sense Disambiguation with Very Large Neural Networks Extracted
from Machine Readable Dictionarie,” in Proceedings of 13th conference on Computational
linguistics, 1990, volume 2, pp. 389-394.
[11] M.M. Khapra, S. Shah, P. Kedia and P. Bhattacharyya, “Projecting Parameters for Multilingual Word
Sense Disambiguationin Proceedings of 2009 Conference on Empirical Methods in Natural Language
Processing, 2009,volume 1, pp. 459-467.
[12] K.Ghosh, R. V. Reddy, N. P. Narendra, S. Maity, S. Koolagudi and K. S. Rao, “Grapheme-to-
phoneme Conversion in Bengali for Festival based TTS Framework”, in Proceedings of International
Conference of Natural Language Processing, 2010, pp. 294.
[13] A.Eneko and G.Rigau, “Word Sense Disambiguation using Conceptual Density”, in Proceedings of
the 16th International Conference on Computational Linguistics (COLING), Copenhagen, Denmark,
1996.

More Related Content

PDF
Cl35491494
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
G1803013542
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
PDF
A survey of named entity recognition in assamese and other indian languages
PDF
A Review on Marathi Language Speech Database Development for Automatic Speech...
PDF
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
Cl35491494
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
G1803013542
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
A survey of named entity recognition in assamese and other indian languages
A Review on Marathi Language Speech Database Development for Automatic Speech...
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES

What's hot (20)

PDF
Cf32516518
PDF
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
PDF
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
PDF
Implementation of Marathi Language Speech Databases for Large Dictionary
PDF
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
PDF
J1803015357
PDF
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PDF
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
PDF
Cross language information retrieval in indian
PDF
An Improved Approach for Word Ambiguity Removal
PDF
Marathi Text-To-Speech Synthesis using Natural Language Processing
PDF
A language independent approach to develop urduir system
PDF
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
PDF
Verb based manipuri sentiment analysis
PDF
Design of A Spell Corrector For Hausa Language
PDF
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
PDF
SCTUR: A Sentiment Classification Technique for URDU
PDF
Current state of the art pos tagging for indian languages – a study
Cf32516518
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
Survey on Indian CLIR and MT systems in Marathi Language
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
Implementation of Marathi Language Speech Databases for Large Dictionary
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
J1803015357
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
Cross language information retrieval in indian
An Improved Approach for Word Ambiguity Removal
Marathi Text-To-Speech Synthesis using Natural Language Processing
A language independent approach to develop urduir system
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
Verb based manipuri sentiment analysis
Design of A Spell Corrector For Hausa Language
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
SCTUR: A Sentiment Classification Technique for URDU
Current state of the art pos tagging for indian languages – a study
Ad

Viewers also liked (7)

PPTX
Jose P Training 12.17.08
PDF
Compound Noun Polysemy and Sense Enumeration in WordNet
PPT
Ppt presentation group 6 1 28 11
PPTX
Synonyms, antonyms, homophones, homographs powerpoint
PPTX
Polysemous and Homonymous expressions
PPTX
DOC
SYNONYMS, ANTONYMS, POLYSEMY, HOMONYM, AND HOMOGRAPH
Jose P Training 12.17.08
Compound Noun Polysemy and Sense Enumeration in WordNet
Ppt presentation group 6 1 28 11
Synonyms, antonyms, homophones, homographs powerpoint
Polysemous and Homonymous expressions
SYNONYMS, ANTONYMS, POLYSEMY, HOMONYM, AND HOMOGRAPH
Ad

Similar to A decision tree based word sense disambiguation system in manipuri language (20)

PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
A Survey on Word Sense Disambiguation
DOC
Doc format.
PDF
Word sense disambiguation a survey
PDF
A supervised word sense disambiguation method using ontology and context know...
PDF
Wsd final paper
PDF
Malayalam Word Sense Disambiguation using Machine Learning Approach
PPT
Advances in word sense disambiguation
PDF
Automatic classification of bengali sentences based on sense definitions pres...
PDF
AMBIGUITY-AWARE DOCUMENT SIMILARITY
PDF
IJNLC 2013 - Ambiguity-Aware Document Similarity
PDF
MODIFIED PAGE RANK ALGORITHM TO SOLVE AMBIGUITY OF POLYSEMOUS WORDS
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
PDF
A performance of svm with modified lesk approach for word sense disambiguatio...
PDF
Ny3424442448
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
PDF
An approach to speed up the word sense disambiguation procedure through sense...
PDF
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
PDF
April 2020 most read artilce in contro theory & computer controlling
International Journal of Engineering and Science Invention (IJESI)
A Survey on Word Sense Disambiguation
Doc format.
Word sense disambiguation a survey
A supervised word sense disambiguation method using ontology and context know...
Wsd final paper
Malayalam Word Sense Disambiguation using Machine Learning Approach
Advances in word sense disambiguation
Automatic classification of bengali sentences based on sense definitions pres...
AMBIGUITY-AWARE DOCUMENT SIMILARITY
IJNLC 2013 - Ambiguity-Aware Document Similarity
MODIFIED PAGE RANK ALGORITHM TO SOLVE AMBIGUITY OF POLYSEMOUS WORDS
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
A performance of svm with modified lesk approach for word sense disambiguatio...
Ny3424442448
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
An approach to speed up the word sense disambiguation procedure through sense...
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
April 2020 most read artilce in contro theory & computer controlling

More from acijjournal (20)

PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PDF
July 2025-Top 10 Read articles ACIJ Advanced Computing: An International Jour...
PDF
MODEL AND ALGORITHM FOR INCREASING THE EFFICIENCY OF REMOTE SERVICE SYSTEMS S...
PDF
15th International Conference on Computer Science, Engineering and Applicatio...
PDF
4th International Conference on Computer Science and Information Technology (...
PDF
APPLICATION AND ANALYSIS OF ENSEMBLE ALGORITHMS IN SOLVING REGRESSION PROBLEMS
PDF
4th International Conference on Computer Science and Information Technology (...
PDF
Application and Analysis of Ensemble Algorithms in Solving Regression Problems
PDF
17th International Conference on Networks & Communications (NeTCoM 2025)
PDF
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
PDF
Advanced Computing: An International Journal (ACIJ)
PDF
6 th International Conference on Data Mining and Software Engineering (DMSE 2...
PDF
ARTICLE :OVERVIEW OF STRUCTURE FROM MOTION
PDF
14th International Conference on Advanced Information Technologies and Applic...
PDF
2nd International Conference on Information Technology Convergence Services &...
PDF
Advanced Computing: An International Journal ( ACIJ )
PDF
3rd International Conference on Computer Science, Engineering and Artificia...
PDF
6th International Conference on Big Data and Machine Learning (BDML 2025)
PDF
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
PDF
4th International Conference on Computing and Information Technology Trends (...
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
July 2025-Top 10 Read articles ACIJ Advanced Computing: An International Jour...
MODEL AND ALGORITHM FOR INCREASING THE EFFICIENCY OF REMOTE SERVICE SYSTEMS S...
15th International Conference on Computer Science, Engineering and Applicatio...
4th International Conference on Computer Science and Information Technology (...
APPLICATION AND ANALYSIS OF ENSEMBLE ALGORITHMS IN SOLVING REGRESSION PROBLEMS
4th International Conference on Computer Science and Information Technology (...
Application and Analysis of Ensemble Algorithms in Solving Regression Problems
17th International Conference on Networks & Communications (NeTCoM 2025)
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
Advanced Computing: An International Journal (ACIJ)
6 th International Conference on Data Mining and Software Engineering (DMSE 2...
ARTICLE :OVERVIEW OF STRUCTURE FROM MOTION
14th International Conference on Advanced Information Technologies and Applic...
2nd International Conference on Information Technology Convergence Services &...
Advanced Computing: An International Journal ( ACIJ )
3rd International Conference on Computer Science, Engineering and Artificia...
6th International Conference on Big Data and Machine Learning (BDML 2025)
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
4th International Conference on Computing and Information Technology Trends (...

Recently uploaded (20)

PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
Modernising the Digital Integration Hub
PDF
STKI Israel Market Study 2025 version august
PDF
Architecture types and enterprise applications.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPT
Geologic Time for studying geology for geologist
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Microsoft Excel 365/2024 Beginner's training
Module 1.ppt Iot fundamentals and Architecture
Enhancing plagiarism detection using data pre-processing and machine learning...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Comparative analysis of machine learning models for fake news detection in so...
Zenith AI: Advanced Artificial Intelligence
NewMind AI Weekly Chronicles – August ’25 Week III
Modernising the Digital Integration Hub
STKI Israel Market Study 2025 version august
Architecture types and enterprise applications.pdf
Consumable AI The What, Why & How for Small Teams.pdf
A proposed approach for plagiarism detection in Myanmar Unicode text
Final SEM Unit 1 for mit wpu at pune .pptx
The influence of sentiment analysis in enhancing early warning system model f...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Geologic Time for studying geology for geologist
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
A review of recent deep learning applications in wood surface defect identifi...
Microsoft Excel 365/2024 Beginner's training

A decision tree based word sense disambiguation system in manipuri language

  • 1. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 DOI:10.5121/acij.2014.5403 17 A DECISION TREE BASED WORD SENSE DISAMBIGUATION SYSTEM IN MANIPURI LANGUAGE Richard Laishram Singh1 , Krishnendu Ghosh1 , Kishorjit Nongmeikapam2 and Sivaji Bandyopadhyay3 1 School of Computer Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar, India 2 Department of Computer Science and Engineering, Manipur Institute of Technology Manipur, India 3 Dept of Computer Science and Engineering, Jadavpur University, Kolkata, India ABSTRACT This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri language. The paper discusses related attempts made in the Manipuri language followed by the proposed plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study. Conventional positional and context based features are suggested to capture the sense of the words, which have ambiguous and multiple senses. The proposed work is expected to predict the senses of the polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The system produces an accuracy of 71.75 %. KEYWORDS Word sense disambiguation(WSD); Classification and regression tree (CART); polysemous word; Manipuri. 1. INTRODUCTION Word sense disambiguation (WSD) is the technique to disambiguate between the senses of a word [1]. It has been recognized as one of the major problems in the field of natural language processing [2]. Word sense disambiguation is not an isolated system by itself, but can be fitted with other important tasks such as, information retrieval, machine translation, speech processing, parts-of-speech tagging and text processing [3, 4]. It has been noted that, the sense of a word is determined based on its syntactic, positional, contextual and other relevant factors. In different scenario, the sense or the meaning of a single word can be different. Such words are known as polysemous words. The current study suggests development of an automatic WSD system [5] with the help of a knowledge acquisition technique. The goal of the present work is to develop a system which can disambiguate between the different senses of a polysemous word in Manipuri language. In European languages, efficient and automatic WSD systems are present. In Indian languages, such systems are mostly rule-based due to lack of standardized database and presence of the proper knowledge acquisition tools. So far, negligible amount of work has been reported in Manipuri language. That motivates the present attempt to collect a suitable database consisting of every possible contexts and senses, used in day-to-day life.
  • 2. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 18 The current study reports mainly the development issues of a WSD system in Manipuri language. Manipuri is basically a Tibeto-Burman language, spoken mainly in the valley of Manipur, a North-Eastern state of India. Hence, the syntactic and semantic structures of the language are different from other Indian languages. The task of disambiguating the senses in Manipuri language poses a challenge due to its agglutinative, reduplicative, compounding and tonal nature [6]. The language uses the traditional Meitei-Mayek script for general use. But, that script is not used in the current study due of presence of limited words and structures. The present work uses Bengali script for developing the database. The lack of a standard font made the task even more challenging. Besides, limited amount of works are reported in Manipuri language in the field of natural language processing such as, POS tagging [7], Transliteration [8] and negligible amount in WSD. The paper is organized as follows: related literatures are discussed in section 2. The details of the corpus are mentioned in section 3 The proposed architecture of the word sense disambiguation module is discussed in the section 4 followed by the conclusion and direction towards future work in section 5. Finally a few important references are added. 2. LITERATURE SURVEY Several works on WSD are reported in English and other European languages. On the contrary, limited works have been noted for Indian languages. The approaches are classified in five groups: (i) selection restriction based disambiguation, (ii) machine learning based approach, (iii) supervised learning approach, (iv) bootstrapping approach and (v) dictionary based disambiguation. 2.1 Selection restriction based disambiguation Selection restriction and type hierarchies are the primary knowledge sources used to disambiguate the senses of the words in most of the integrated approaches. In an integrated rule based approach for semantic analysis [9], selection restrictions block the formation of component meaning representations containing the selection restriction violations [1]. For Indian languages, where the semantic and syntactic structures are not analyzed properly and the required knowledge sources are missing, rule based processes generally are not suggested. 2.2 Machine learning based approach The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations. 2.3 Supervised learning approach Supervised learning is a specific category of the machine learning approaches to train the classifier from supervised (labelled) training data [1]. The classifier is fed with the training data consisting of the exemplars along with the output labels. The classifier predicts the correct output value for any valid test input object.
  • 3. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 19 2.4 Bootstrapping approach Bootstrapping approach is an incrementive machine learning based approach, widely used for developing word sense disambiguation systems. Bootstrapping approach eliminates need for a large training set by relying on a relatively less number of instances called seeds. Seeds are used for training and testing. The generated pair of test patterns and the predicted output is later added with the seeds to populate the training data. This process is very useful, if the base database is comparatively small in size. 2.5 Bootstrapping approach Dictionary based disambiguation approaches use a dictionary which provides both the means of constructing a sense tagger and the target sense [1, 10]. In order to perform a large scale disambiguation, machine readable dictionaries (MRD) are used to automate the task. This process is useful for the languages having a digitalized machine readable dictionary. Due to the lack of standard database and related knowledge acquisition tools, such approaches can hardly be used for the focus Manipuri or other Indian languages. 3. DATABASE COLLECTION The database used in the current study contains a total of 672 Manipuri sentences. These sentences are collected from a local newspaper "The Sangai Express". The sentences contain a total of 13,167 words and around 2,000 polysemous words. Polysemous words are the focus of the current study as they contain more than one sense based on contexts and other factors. The sentences are chosen from the news domains to accommodate day-to-day words and senses mostly. The entire database, due to lack of digital resources and standard font are typed manually [11]. The database is thoroughly checked for inconsistencies present in spelling, spacing and, punctuation. The polysemous words are then tagged with the sense manually. 4. PROPOSED ARCHITECTURE The current study attempts to find a solution to disambiguate between the possible senses of Manipuri words. A base database is collected with sentences consisting of day-to-day senses. The database is then preprocessed and suggested features are collected out of it. Then, a classifier is developed with the knowledge obtained from the features using a learning algorithm. The classifier finally predicts the sense of the test sentences. The word sense disambiguation system works in two phases: (i) training phase and (ii) testing phase. In the training phase, the training data is first preprocessed and features are generated. The features are used then to train the classifier based on any learning algorithm. On the contrary, in the testing phase, the test data is preprocessed and features are generated. The features of the test data are fed to the classifier and the predicted output is compared for performance evaluation of the system. Hence, the proposed architecture of the word sense disambiguation module for Manipuri language contains five building blocks: (i) preprocessing, (ii) feature selection and generation and (iii) training, (iv) testing and (v) performance evaluation. A. Preprocessing Preprocessing is the module where the raw data is processed so that the features can be generated from the training or test data efficiently. Data is converted to corresponding ASCII format after
  • 4. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 20 the inconsistencies (like spelling, punctuations and spacing mistakes) are removed. When the data is converted to ASCII format, the selected features are generated semi-automatically. B. Feature Selection and Generation In this phase, the ASCII format generated database is used to build the feature. The current study, as it is in the preliminary attempt, uses a set of very common, widely popular and easily extractable features. A total of 6 features are taken to build feature: (i) the focus word for which the sense is to be derived, (ii) the normalized position of the word in the sentence, (iii) the previous word, (iv) the previous-to-previous word, (v) the next word, (vi) the next to next word. A 5-gram window is formed using the pair of the focus word and its context words which forms the context information. A focus word, based on the context may have different senses. Hence, in order to disambiguate the sense of the focused word, the contextual information is very much necessary [13], and helps in predicting the correct one. In the current study positional feature is suggested because of the lack of other relevant morphological features. As the syntactic and semantic structures of a sentence remain mostly similar for a particular language, this feature contains probable morphological information. To generate the final input feature vector, from the database mentioned above mentioned six features are collected automatically. C. Training In the present work, a supervised learning algorithm is proposed based on decision tree algorithm. In supervised approaches, the classifier is developed with the help of the input feature vector and the output labels of the data. Hence, the final feature vector is developed using the six features mentioned in section B and the output sense of the focus word. The sense of the focus word is derived manually and finally the seven entries are fed to the classifier. The classifier is trained with the decision tree algorithm. The decision tree is a simple and linear decision making approach. In the current work, classification and regression tree (CART) based algorithm is suggested to train the classifier. Binary decision tree is developed automatically using the seven entries of the focus word present in training data. CART uses all the instances of the training data and asks binary questions on the features. CART algorithm selects the most predictive feature from the feature set and the best possible question to achieve accurate classification of the training. Based on the feature and its distribution of values, the training data is segmented into two parts. This segmentation is carried out based on each of the features and finally at the leaves, a output class is achieved. CART has been reported as one of the suitable learning algorithms for developing supervised learning models for Indian languages like Manipuri whose characteristics have not studied in detail. An example of a CART is given in Fig 1.
  • 5. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 21 Fig 1: Flow Chart of CART based word sense disambiguation D. Testing During the testing, CART offers prediction by comparing the features for the test case with the trained decision tree. While predicting the sense for a test word, the corresponding features are generated and compared with the trained decision tree. Starting from the root node asking a question on individual features, the features of the test word will lead to a leaf which offers an approximated output. The features are already generated in the feature generation phase. The trained CART is fed with features are test data and corresponding approximations are noted. The predictions are later compared with the correct sense tags to perform evaluation of the current system. E. Performance Evaluation The accuracy of the proposed word sense disambiguation system is evaluated using an objective analysis. The analysis is carried out by determining the correct sense prediction percentage for the test words. Being trained on 1,600 words and tested on 400 words using 7 features, the word sense disambiguation system predicts the senses with 71.75% accuracy. The performance of the proposed model is discussed in TABLE 1. TABLE I: Objective Evaluation of CART based word sense disambiguation model 5. CONCLUSION AND FUTURE WORK This paper proposed a preliminary attempt to predict the senses of the words for Manipuri language. A set of positional and contextual features are suggested for developing the word sense disambiguation system. Due to unavailability of morphological and other relevant linguistic Predicted Sense Correctly Predicted Sense Accuracy 400 287 71.75%
  • 6. Advanced Computing: An International Journal (ACIJ), Vol.5, No.4, July 2014 22 knowledge, the accuracy of the model is lower than the models built for languages like English. Moreover, the size of training data or the features used are not sufficient to achieve the classification quality of human transcribers. A relevant and appropriate feature set can achieve higher accuracy. Further improvements are expected by: 1. Joining a rule-based model with the CART based data-driven model to improve accuracy of the word sense disambiguation system. 2. Exploring other well-known classifiers such as ANN, SVM or other probabilistic methods like HMM and N-gram models. ACKNOWLEDGEMENTS The work presented in this paper is performed at KIIT University as a project for the fullfilment of the degree of Masters of Technology. Special thanks to Mr. Kishorjit N. Singh for his help in data collection. REFERENCES [1] D.Jurafsky, J.H.Martin, “Speech and Language Processing,”, Pearson Publishers, December, pp. 658- 672 [2] R.Mihalcea, D. Moldovan, “Word Sense Disambiguation based on Semantic Density”, in proceedings of Seventh International conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, pp. 1492-1496. [3] M.Sinha, M.K.Reddy, P.Bhattacharyya, P.Pandey and L.Kashyap, “Word Sense Disambiguation,” in proceedings of International Journal of Computer Applications (IJCA),2010, vol. 5.9, pp. 25-32. [4] R.Mihalcea and D. Moldovan, “A Method for Disambiguating Word Senses in a Large Corpus,” in proceedings of COLING/ACL Workshop on Usage of WordNet in Natural Language Processing, Computers and the Humanities, 1992, volume 26, no. 5-6,pp. 415-439. [5] C.Leacock, G.Towell and E.Voorhees, “Corpus-Based Statistical Sense Resolution,” ARPA Workshop on Human Language Technology, 1993. [6] T.D.Singh and S.Bandyopadhyay, “Word Word Class and Sentence Type Identification in Manipuri Morphological Analyzer,” in proceedings of MSPIL, 2006. [7 ]K.Nongmeikapam, L.Nonglenjaoba, Y.Nirmal and S.Bandhyopadhyay, “Improvement of CRF based Manipuri POS tagger by using Reduplicated MWE (RMWE),” arXiv preprint arXiv: 1111.2399, 2011. [8] K.Nongmeikapam and S. Bandhyopadhyay, “A Transliteration of CRF Based Manipuri POS Tagging,” in proceedings of Procedia Technology, Elsevier, 2012, pp. 582-589. [9] M S. Olsen, “WordNet Word sense Disambiguation using an Automatically Generated Ontology,” Class of 2003 Senior Conference on Natural Language Processing, 2003. [10] J.Veronis and M. Nancy, “Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionarie,” in Proceedings of 13th conference on Computational linguistics, 1990, volume 2, pp. 389-394. [11] M.M. Khapra, S. Shah, P. Kedia and P. Bhattacharyya, “Projecting Parameters for Multilingual Word Sense Disambiguationin Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing, 2009,volume 1, pp. 459-467. [12] K.Ghosh, R. V. Reddy, N. P. Narendra, S. Maity, S. Koolagudi and K. S. Rao, “Grapheme-to- phoneme Conversion in Bengali for Festival based TTS Framework”, in Proceedings of International Conference of Natural Language Processing, 2010, pp. 294. [13] A.Eneko and G.Rigau, “Word Sense Disambiguation using Conceptual Density”, in Proceedings of the 16th International Conference on Computational Linguistics (COLING), Copenhagen, Denmark, 1996.