SlideShare a Scribd company logo
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
10.5121/ijnlc.2014.3312 121
CHUNKING IN MANIPURI USING CRF
Kishorjit Nongmeikapam1
,Chiranjiv Chingangbam1
, Nepoleon Keisham1
,
Biakchungnunga Varte1
, Sivaji Bandopadhyay2
1
Department of Computer Science & Engineering, Manipur Institute of Technology,
Manipur University, Imphal, India
2
Department of Computer Science & Engineering, Jadavpur University, West Bengal,
India
ABSTRACT
This paper deals about the chunking of the Manipuri language, which is very highly agglutinative in
Nature. The system works in such a way that the Manipuri text is clean upto the gold standard. The text is
processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The output file is
treated as an input file for the CRF based Chunking system. The final output is a completely chunk tag
Manipuri text. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21%.
KEYWORDS
CRF; POS; Chunk; Manipuri
1. INTRODUCTION
The Manipuri Language has its origin in the north-eastern parts of India, widely spoken in the
state Manipur, and some in the countries of Myanmar and Bangladesh. The Manipuri Language
belongs to a high agglutinative class of language. The Conditional Random Fields (CRFs) serve
as a powerful model for predicting structured labeling.
Chunking is the process of identifying and labeling the simple phrases (it may be a Noun Phrase
or a Verb Phrase) from the tagged output, of which the utterance of words for a given phrase
forms as a chunk for this language. A POS tagged sequence output might also form as a base
input for the CRF-based chunking.
We synthesized a full scale Manipuri chunked file as the output. The procedure that we follow is
that the input file is passed onto a CRF based POS tagger, and then this output from the tagger
serve as the input for the CRF based Chunking, which duly generates the output chunked file.
The paper is arranged in such a way that the related works is listed in Section II. Section III
describes the concept of Conditional Random Field (CRF) which is followed by the System
design at IV. The experiment and evaluation is discussed at Section V and the conclusion is
drawn at Section VI.
2. RELATED WORKS
Until now, no works in the area of CRF based chunking has ever been performed on the Manipuri
language. Most of the previous works for other languages on this area make use of two machine-
learning approaches for sequence labeling, namely HMM in [1] and the second approach as the
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
122
sequence labeling problem as a sequence of a classification problem, one for each of the labels in
the sequence.
Apart from the above two approaches, the CRF based chunking utilizes and gives the best of the
generative and classification models. It resembles the classical model, in a way that they can
accommodate many statistically correlated features of the inputs. And consecutively, it resembles
the generative model, they have the ability to trade-off decisions at different sequence positions,
and consequently it obtains a globally optimal labeling. It is shown in [2] that CRFs are better
than related classification models. Parsing by chunks is discussed in [3]. Dynamic programming
for parsing and estimation of stochastic unication-based grammars is mentioned in [4] and other
related works are found in [5]-[7].
And on the field of text chunking, [1] proposed a Conditional Random Field based approach. The
works on chunking can be observe applying both rule based and the probabilistic or statistical
methods.
3. CONCEPT OF CONDITION RANDOM FIELD
The concept of Conditional Random Field [8] is developed in order to calculate the conditional
probabilities of values on other designated input nodes of undirected graphical models. CRF
encodes a conditional probability distribution with a given set of features. It is an unsupervised
approach where the system learns by giving some training and can be used for testing other texts.
The conditional probability of a state sequence X=(x1, x2,..xT) given an observation sequence
Y=(y1, y2,..yT) is calculated as :
P(Y|X) =
1
ZX
exp(∑
t= 1
T
∑
k
λk f k( yt-1 ,yt , X,t))
---(1)
where, fk( yt-1,yt, X, t) is a feature function whose weight λk is a learnt weight associated with fk
and to be learned via training. The values of the feature functions may range between -∞ … +∞,
but typically they are binary. ZX is the normalization factor:
∑∑∑ =
=
T
t k
kk
y
XZ
1
t1-t t))X,,y,y(fexp λ
---(2)
which is calculated in order to make the probability of all state sequences sum to 1. This is
calculated as in Hidden Markov Model (HMM) and can be obtained efficiently by dynamic
programming. Since CRF defines the conditional probability P(Y|X), the appropriate objective
for parameter learning is to maximize the conditional likelihood of the state sequence or training
data.
∑=
N
1i
)x|P(ylog ii
---(3)
where, {(xi
, yi
)} is the labeled training data.
Gaussian prior on the λ’s is used to regularize the training (i.e., smoothing). If λ ~
N(0,ρ2
), the objective function becomes,
∑∑ −
= k
i
2
2N
1i 2
)x|P(ylog ii
ρ
λ
---(4)
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
123
The objective function is concave, so the λ’s have a unique set of optimal values.
4. SYSTEM DESIGN
The system works with the application of CRF in two layers. The first layer is meant for the POS
tagging of the Manipuri text file using certain features as mention in [9]. In the second layer the
output file of the CRF based POS tagging is used as an input file of the CRF based chunking.
Fig.1 explains the System block diagram.
The chunking tag is the I-O-B tagging. That is as follows:
TABLE I. IOB TAGGING
B-X Beginning of the chunk word X
I-X
Intermediate or non beginning chunk
word X
O Word outside of the chunk text
The processing and running of the CRF is shown on Fig. 2.
Figure 1. System Block diagram
The input file for the first time is a training file which gives and output of a model file and in the
second run the input file is a testing file. The output file of the CRF is a labeled file.
TEXT FILE
CRF BASED POS
TAGGER
CRF BASED MANIPURI
CHUNKER
CHUNKED MANIPURI
FILE
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
124
Figure 2. CRF based POS tagging
The working of CRF is mainly based on the feature selection. The feature listed for the POS
tagging is as follows:
F= { Wi-m, … ,W i-1, W i, W i+1, …, W i+n, SWi-m, …, SWi-1, SWi, SWi+1,… , SWi-n , number of
acceptable standard suffixes, number of acceptable standard prefixes, acceptable suffixes
present in the word, acceptable prefixes present in the word, word length, word frequency,
digit feature, symbol feature, RMWE}
The details of the set of features that have been applied for POS tagging in Manipuri are as
follows:
The details of the set of features that have been applied for POS tagging in Manipuri are as
follows:
1. Surrounding words as feature: Preceeding word(s) or the successive word(s) are important in
POS tagging because these words play an important role in determining the POS of the present
word.
2. Surrounding Stem words as feature: The Stemming algorithm mentioned in [10] is used.
The preceding and the following stemmed words of a particular word can be used as features. It is
because the preceding and the following words influence the present word POS tagging.
3. Number of acceptable standard suffixes as feature: As mention in [10], Manipuri being an
agglutinative language the suffixes plays an important in determining the POS of a word. For
every word the number of suffixes are identified during stemming and the number of suffixes is
used as a feature.
4. Number of acceptable standard prefixes as feature: Prefixes plays an important role for
Manipuri language. Prefixes are identified during stemming and the prefixes are used as a feature.
5. Acceptable suffixes present as feature: The standard 61 suffixes of Manipuri which are
identified is used as one feature. The maximum number of appended suffixes is reported as ten.
So taking into account of such cases, for every word ten columns separated by a space are created
for every suffix present in the word. A “0” notation is being used in those columns when the word
consists of no acceptable suffixes.
6. Acceptable prefixes present as feature: 11 prefixes have been manually identified in
Manipuri and the list of prefixes is used as one feature. For every word if the prefix is present
then a column is created mentioning the prefix, otherwise the “0” notation is used.
7. Length of the word: Length of the word is set to 1 if it is greater than 3 otherwise, it is set to
0. Very short words are generally pronouns and rarely proper nouns.
8. Word frequency: A range of frequency for words in the training corpus is set: those words
with frequency <100 occurrences are set the value 0, those words which occurs >=100 are set to
1. It is considered as one feature since occurrence of determiners, conjunctions and pronouns are
abundant.
Evaluation Results
Pre-processing
Documents Collection
Data Test
Labeling
Features Extraction
Data Training
CRF Model
Features Extraction
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
125
9. Digit features: Quantity measurement, date and monetary values are generally digits. Thus the
digit feature is an important feature. A binary notation of ‘1’ is used if the word consist of a digit
else ‘0’.
10. Symbol feature: Symbols like $,% etc. are meaningful in textual use, so the feature is set to 1
if it is found in the token, otherwise 0. This helps to recognize Symbols and Quantifier number
tags.
11. Reduplicated Multiword Expression (RMWE): (RMWE) are also considered as a feature
since Manipuri is rich of RMWE. The work of RMWE is used in [11].
5. EXPERIMENT AND EVALUATION
The text document file is cleaned for processing where the error and grammatical mistakes are
minutely checked by an expert. For the POS tagging the expert also mark each word with the
POS using a tag set. The POS marked texts are used for both training and testing.
Once the text document are tagged with the POS the same text with POS and the previous
features are used to run the CRF based chunking. In other word the POS tag are used as the other
features for the chunking. The C++ based CRF++ 0.53 package1
is used in this work and it is
readily available as open source for segmenting or labeling sequential data.
In total to train and test the system 30000 words corpus is used. This corpus is considered as gold
standard since an expert manually identifies the POS and the chunk words. Fig.3 shows the
sample of POS and chunking which are marked by the expert.
……………………………………………………....
................................................
oooo NN B-X
aaaa JJ B-X
NC I-X
aaaa QT I-X
VFC B-X
| SYM O
……….
………..
Figure 3. Smaple of the words with POS and BOI chunking
Of the 30000 words 20000 words are considered for the training and the rest of the 10000 are
used for the testing.
Evaluation is done with the parameter of Recall, Precision and F-score as follows:
Recall, R =
texttheinanscorrectofNo
systemthebygivenanscorrectofNo
Precision, P =
systemthebygivenansofNo
systemthebygivenanscorrectofNo
1

http://crfpp.sourceforge.net/
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
126
F-score, F =
RRRRPPPP2222ββββ 1)PR1)PR1)PR1)PR2222((((ββββ +
+
Where is one, precision and recall are given equal weight.
Different combinations of the features are tried for the chunking of the Manipuri text document.
Among the combinations the best features are found to be as follows:
F= { Wi-2, W i-1, W i, W i+1, SWi-1, SWi, SWi+1, number of acceptable standard suffixes,
number of acceptable standard prefixes, acceptable suffixes present in the word, acceptable
prefixes present in the word, word length, word frequency, digit feature, symbol feature,
reduplicated MWE, POS}
The Table II shows the recall, precision and f-measure of the system.
TABLE I. BEST RESULT
3.
CO
NCLUSIONS
So far, the chunking work on Manipuri is not reported and this work can be a starting point for the
future. Other algorithms for the improvement of the score can also be worked on. The main
handicap with this language is its highly agglutinative nature. The system shows a recall of
71.30%, a precision of 77.36% and a F-measure of 74.21% which has lot of rooms for
improvement.
REFERENCES
[1] Fei Sha and Fernando Pereira,“Shallow Parsing with Conditional Random Fields”.In the Proceedings
of HLT-NAACL 2003.
[2] John Lafferty, Andrew McCallum and Fernando Pereira, Conditional Random Fields: Probabilistic
Models for Segment-ing and Labeling Sequence Data.
[3] S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and C. Tenny, editors, Principle-based
Parsing. Kluwer Academic Publishers, 1991.
[4] S. Geman and M. Johnson. Dynamic programming for parsing and estimation of stochastic
uni_cation-based grammars. In Proc. 40th ACL, 2002.
[5] A. Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. In C.
Cardie and R. Weischedel, editors, EMNLP-2. ACL, 1997.
[6] E. F. T. K. Sang. Memory-based shallow parsing. Journal of Machine Learning Research, 2:559.594,
2002.
[7] T. Zhang, F. Damerau, and D. Johnson. Text chunking based on a generalization of winnow. Journal
of Machine Learning Research, 2:615.637, 2002.
[8] Lafferty, J., McCallum, A., Pereira, F. Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data, In the Procceedings of the 18th ICML01, Williamstown,
MA, USA., 2001, p. 282-289.
[9] Kishorjit, N. and Sivaji, B., “A Transliteration of CRF Based Manipuri POS Tagging”, In the
Proceedings of 2nd International Conference on Communication, Computing  Security (ICCCS-
2012), Elsevier Ltd, 2012
Model Recall Precision F-Score
CRF 71.30 77.36 74.21
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
127
[10] Kishorjit, N., Bishworjit, S., Romina, M., Mayekleima Chanu, Ng.  Sivaji, B., (2011) A Light
Weight Manipuri Stemmer, In the Proceedings of Natioanal Conference on Indian Language
Computing (NCILC), Chochin, India
[11] Kishorjit Nongmeikapam, Nonglenjaoba L., Nirmal Y.  Sivaji Bandhyopadhyay, Reduplicated
MWE (RMWE) Helps in Improving the CRF Based Manipuri POS Tagger, International Journal of
Information Technology Convergence and Services (IJITCS) Vol.2, No.1, DOI :
10.5121/ijitcs.2012.2106, 2012, p.45-59.
Authors
Kishorjit Nongmeikapam is working as Asst. Professor at Department of Computer
Science and Engineering, MIT, Manipur University, India. He has completed his BE from
PSG college of Tech., Coimbatore and has completed his ME from Jadavpur University,
Kolkata, India. He is presently doing research in the area of Multiword Expression and its
applications. He has so far published 30 papers and presently handling a Transliteration
project funded by DST, Govt. of Manipur, India. He is the author of the Book, “See the C
Programming Language”.
Chiranjiv Chingangbam is presently a student of Manipur Institute Of Technology. He
is pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is
NLP.
Nepoleon Keisham is presently a student of Manipur Institute Of Technology. He is
pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is
NLP.
Biakchnungnunga Varte is presently a student of Manipur Institute Of Technology. He is
pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is
NLP.
Sivaji Bandyopadhyay is working as a Professor since 2001 in the Computer Science
and Engineering Department at Jadavpur University, Kolkata, India. His research interests
include machine translation, sentiment analysis, textual entailment, question answering
systems and information retrieval among others. He is currently supervising six national
and international level projects in various areas of language technology. He has published
a large number of journal and conference publications.

More Related Content

PDF
Verb based manipuri sentiment analysis
PDF
Quality estimation of machine translation outputs through stemming
PDF
Compression-Based Parts-of-Speech Tagger for The Arabic Language
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
PDF
A Novel Approach for Rule Based Translation of English to Marathi
PDF
Extractive Summarization with Very Deep Pretrained Language Model
Verb based manipuri sentiment analysis
Quality estimation of machine translation outputs through stemming
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Named Entity Recognition using Hidden Markov Model (HMM)
Survey on Indian CLIR and MT systems in Marathi Language
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
A Novel Approach for Rule Based Translation of English to Marathi
Extractive Summarization with Very Deep Pretrained Language Model

What's hot (17)

PDF
BERT: Bidirectional Encoder Representations from Transformers
PPTX
[Paper review] BERT
PDF
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
PDF
An expert system for automatic reading of a text written in standard arabic
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PPTX
BERT introduction
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PPTX
NLP State of the Art | BERT
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
PDF
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PPTX
PDF
Improving the role of language model in statistical machine translation (Indo...
PDF
combination
PDF
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
DOC
P-6
PDF
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
BERT: Bidirectional Encoder Representations from Transformers
[Paper review] BERT
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
An expert system for automatic reading of a text written in standard arabic
1909 BERT: why-and-how (CODE SEMINAR)
BERT introduction
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
NLP State of the Art | BERT
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
Improving the role of language model in statistical machine translation (Indo...
combination
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
P-6
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
Ad

Viewers also liked (20)

PDF
CLUSTERING WEB SEARCH RESULTS FOR EFFECTIVE ARABIC LANGUAGE BROWSING
PDF
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
PDF
Hybrid part of-speech tagger for non-vocalized arabic text
PDF
Building a vietnamese dialog mechanism for v dlg~tabl system
PDF
A MULTI-STREAM HMM APPROACH TO OFFLINE HANDWRITTEN ARABIC WORD RECOGNITION
PDF
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
PDF
Smart grammar a dynamic spoken language understanding grammar for inflective ...
PDF
Developemnt and evaluation of a web based question answering system for arabi...
PDF
An improved apriori algorithm for association rules
PDF
A comparative analysis of particle swarm optimization and k means algorithm f...
PDF
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
PDF
Evaluation of subjective answers using glsa enhanced with contextual synonymy
PDF
An exhaustive font and size invariant classification scheme for ocr of devana...
PDF
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAl
PDF
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
PDF
Conceptual framework for abstractive text summarization
PPT
khelchandra project on ai
PDF
3ªHistoria: Lánzate a la Piscina
DOCX
doc
 
XLS
New Microsoft Excel Worksheet
CLUSTERING WEB SEARCH RESULTS FOR EFFECTIVE ARABIC LANGUAGE BROWSING
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
Hybrid part of-speech tagger for non-vocalized arabic text
Building a vietnamese dialog mechanism for v dlg~tabl system
A MULTI-STREAM HMM APPROACH TO OFFLINE HANDWRITTEN ARABIC WORD RECOGNITION
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
Smart grammar a dynamic spoken language understanding grammar for inflective ...
Developemnt and evaluation of a web based question answering system for arabi...
An improved apriori algorithm for association rules
A comparative analysis of particle swarm optimization and k means algorithm f...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
Evaluation of subjective answers using glsa enhanced with contextual synonymy
An exhaustive font and size invariant classification scheme for ocr of devana...
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAl
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
Conceptual framework for abstractive text summarization
khelchandra project on ai
3ªHistoria: Lánzate a la Piscina
doc
 
New Microsoft Excel Worksheet
Ad

Similar to Chunking in manipuri using crf (20)

PDF
Phonetic distance based accent
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PDF
Isolated word recognition using lpc & vector quantization
PDF
Isolated word recognition using lpc &amp; vector quantization
PDF
Myanmar Named Entity Recognition with Hidden Markov Model
PDF
E0502 01 2327
PDF
5215ijcseit01
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
A survey of named entity recognition in assamese and other indian languages
PDF
50120140503001
PDF
50120140503001
PDF
50120140503001
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
Arabic text categorization algorithm using vector evaluation method
PDF
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
PDF
ANN Based POS Tagging For Nepali Text
Phonetic distance based accent
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc &amp; vector quantization
Myanmar Named Entity Recognition with Hidden Markov Model
E0502 01 2327
5215ijcseit01
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
A survey of named entity recognition in assamese and other indian languages
50120140503001
50120140503001
50120140503001
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
Arabic text categorization algorithm using vector evaluation method
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
ANN Based POS Tagging For Nepali Text

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Chunking in manipuri using crf

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 10.5121/ijnlc.2014.3312 121 CHUNKING IN MANIPURI USING CRF Kishorjit Nongmeikapam1 ,Chiranjiv Chingangbam1 , Nepoleon Keisham1 , Biakchungnunga Varte1 , Sivaji Bandopadhyay2 1 Department of Computer Science & Engineering, Manipur Institute of Technology, Manipur University, Imphal, India 2 Department of Computer Science & Engineering, Jadavpur University, West Bengal, India ABSTRACT This paper deals about the chunking of the Manipuri language, which is very highly agglutinative in Nature. The system works in such a way that the Manipuri text is clean upto the gold standard. The text is processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The output file is treated as an input file for the CRF based Chunking system. The final output is a completely chunk tag Manipuri text. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21%. KEYWORDS CRF; POS; Chunk; Manipuri 1. INTRODUCTION The Manipuri Language has its origin in the north-eastern parts of India, widely spoken in the state Manipur, and some in the countries of Myanmar and Bangladesh. The Manipuri Language belongs to a high agglutinative class of language. The Conditional Random Fields (CRFs) serve as a powerful model for predicting structured labeling. Chunking is the process of identifying and labeling the simple phrases (it may be a Noun Phrase or a Verb Phrase) from the tagged output, of which the utterance of words for a given phrase forms as a chunk for this language. A POS tagged sequence output might also form as a base input for the CRF-based chunking. We synthesized a full scale Manipuri chunked file as the output. The procedure that we follow is that the input file is passed onto a CRF based POS tagger, and then this output from the tagger serve as the input for the CRF based Chunking, which duly generates the output chunked file. The paper is arranged in such a way that the related works is listed in Section II. Section III describes the concept of Conditional Random Field (CRF) which is followed by the System design at IV. The experiment and evaluation is discussed at Section V and the conclusion is drawn at Section VI. 2. RELATED WORKS Until now, no works in the area of CRF based chunking has ever been performed on the Manipuri language. Most of the previous works for other languages on this area make use of two machine- learning approaches for sequence labeling, namely HMM in [1] and the second approach as the
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 122 sequence labeling problem as a sequence of a classification problem, one for each of the labels in the sequence. Apart from the above two approaches, the CRF based chunking utilizes and gives the best of the generative and classification models. It resembles the classical model, in a way that they can accommodate many statistically correlated features of the inputs. And consecutively, it resembles the generative model, they have the ability to trade-off decisions at different sequence positions, and consequently it obtains a globally optimal labeling. It is shown in [2] that CRFs are better than related classification models. Parsing by chunks is discussed in [3]. Dynamic programming for parsing and estimation of stochastic unication-based grammars is mentioned in [4] and other related works are found in [5]-[7]. And on the field of text chunking, [1] proposed a Conditional Random Field based approach. The works on chunking can be observe applying both rule based and the probabilistic or statistical methods. 3. CONCEPT OF CONDITION RANDOM FIELD The concept of Conditional Random Field [8] is developed in order to calculate the conditional probabilities of values on other designated input nodes of undirected graphical models. CRF encodes a conditional probability distribution with a given set of features. It is an unsupervised approach where the system learns by giving some training and can be used for testing other texts. The conditional probability of a state sequence X=(x1, x2,..xT) given an observation sequence Y=(y1, y2,..yT) is calculated as : P(Y|X) = 1 ZX exp(∑ t= 1 T ∑ k λk f k( yt-1 ,yt , X,t)) ---(1) where, fk( yt-1,yt, X, t) is a feature function whose weight λk is a learnt weight associated with fk and to be learned via training. The values of the feature functions may range between -∞ … +∞, but typically they are binary. ZX is the normalization factor: ∑∑∑ = = T t k kk y XZ 1 t1-t t))X,,y,y(fexp λ ---(2) which is calculated in order to make the probability of all state sequences sum to 1. This is calculated as in Hidden Markov Model (HMM) and can be obtained efficiently by dynamic programming. Since CRF defines the conditional probability P(Y|X), the appropriate objective for parameter learning is to maximize the conditional likelihood of the state sequence or training data. ∑= N 1i )x|P(ylog ii ---(3) where, {(xi , yi )} is the labeled training data. Gaussian prior on the λ’s is used to regularize the training (i.e., smoothing). If λ ~ N(0,ρ2 ), the objective function becomes, ∑∑ − = k i 2 2N 1i 2 )x|P(ylog ii ρ λ ---(4)
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 123 The objective function is concave, so the λ’s have a unique set of optimal values. 4. SYSTEM DESIGN The system works with the application of CRF in two layers. The first layer is meant for the POS tagging of the Manipuri text file using certain features as mention in [9]. In the second layer the output file of the CRF based POS tagging is used as an input file of the CRF based chunking. Fig.1 explains the System block diagram. The chunking tag is the I-O-B tagging. That is as follows: TABLE I. IOB TAGGING B-X Beginning of the chunk word X I-X Intermediate or non beginning chunk word X O Word outside of the chunk text The processing and running of the CRF is shown on Fig. 2. Figure 1. System Block diagram The input file for the first time is a training file which gives and output of a model file and in the second run the input file is a testing file. The output file of the CRF is a labeled file. TEXT FILE CRF BASED POS TAGGER CRF BASED MANIPURI CHUNKER CHUNKED MANIPURI FILE
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 124 Figure 2. CRF based POS tagging The working of CRF is mainly based on the feature selection. The feature listed for the POS tagging is as follows: F= { Wi-m, … ,W i-1, W i, W i+1, …, W i+n, SWi-m, …, SWi-1, SWi, SWi+1,… , SWi-n , number of acceptable standard suffixes, number of acceptable standard prefixes, acceptable suffixes present in the word, acceptable prefixes present in the word, word length, word frequency, digit feature, symbol feature, RMWE} The details of the set of features that have been applied for POS tagging in Manipuri are as follows: The details of the set of features that have been applied for POS tagging in Manipuri are as follows: 1. Surrounding words as feature: Preceeding word(s) or the successive word(s) are important in POS tagging because these words play an important role in determining the POS of the present word. 2. Surrounding Stem words as feature: The Stemming algorithm mentioned in [10] is used. The preceding and the following stemmed words of a particular word can be used as features. It is because the preceding and the following words influence the present word POS tagging. 3. Number of acceptable standard suffixes as feature: As mention in [10], Manipuri being an agglutinative language the suffixes plays an important in determining the POS of a word. For every word the number of suffixes are identified during stemming and the number of suffixes is used as a feature. 4. Number of acceptable standard prefixes as feature: Prefixes plays an important role for Manipuri language. Prefixes are identified during stemming and the prefixes are used as a feature. 5. Acceptable suffixes present as feature: The standard 61 suffixes of Manipuri which are identified is used as one feature. The maximum number of appended suffixes is reported as ten. So taking into account of such cases, for every word ten columns separated by a space are created for every suffix present in the word. A “0” notation is being used in those columns when the word consists of no acceptable suffixes. 6. Acceptable prefixes present as feature: 11 prefixes have been manually identified in Manipuri and the list of prefixes is used as one feature. For every word if the prefix is present then a column is created mentioning the prefix, otherwise the “0” notation is used. 7. Length of the word: Length of the word is set to 1 if it is greater than 3 otherwise, it is set to 0. Very short words are generally pronouns and rarely proper nouns. 8. Word frequency: A range of frequency for words in the training corpus is set: those words with frequency <100 occurrences are set the value 0, those words which occurs >=100 are set to 1. It is considered as one feature since occurrence of determiners, conjunctions and pronouns are abundant. Evaluation Results Pre-processing Documents Collection Data Test Labeling Features Extraction Data Training CRF Model Features Extraction
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 125 9. Digit features: Quantity measurement, date and monetary values are generally digits. Thus the digit feature is an important feature. A binary notation of ‘1’ is used if the word consist of a digit else ‘0’. 10. Symbol feature: Symbols like $,% etc. are meaningful in textual use, so the feature is set to 1 if it is found in the token, otherwise 0. This helps to recognize Symbols and Quantifier number tags. 11. Reduplicated Multiword Expression (RMWE): (RMWE) are also considered as a feature since Manipuri is rich of RMWE. The work of RMWE is used in [11]. 5. EXPERIMENT AND EVALUATION The text document file is cleaned for processing where the error and grammatical mistakes are minutely checked by an expert. For the POS tagging the expert also mark each word with the POS using a tag set. The POS marked texts are used for both training and testing. Once the text document are tagged with the POS the same text with POS and the previous features are used to run the CRF based chunking. In other word the POS tag are used as the other features for the chunking. The C++ based CRF++ 0.53 package1 is used in this work and it is readily available as open source for segmenting or labeling sequential data. In total to train and test the system 30000 words corpus is used. This corpus is considered as gold standard since an expert manually identifies the POS and the chunk words. Fig.3 shows the sample of POS and chunking which are marked by the expert. …………………………………………………….... ................................................ oooo NN B-X aaaa JJ B-X NC I-X aaaa QT I-X VFC B-X | SYM O ………. ……….. Figure 3. Smaple of the words with POS and BOI chunking Of the 30000 words 20000 words are considered for the training and the rest of the 10000 are used for the testing. Evaluation is done with the parameter of Recall, Precision and F-score as follows: Recall, R = texttheinanscorrectofNo systemthebygivenanscorrectofNo Precision, P = systemthebygivenansofNo systemthebygivenanscorrectofNo 1 http://crfpp.sourceforge.net/
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 126 F-score, F = RRRRPPPP2222ββββ 1)PR1)PR1)PR1)PR2222((((ββββ + + Where is one, precision and recall are given equal weight. Different combinations of the features are tried for the chunking of the Manipuri text document. Among the combinations the best features are found to be as follows: F= { Wi-2, W i-1, W i, W i+1, SWi-1, SWi, SWi+1, number of acceptable standard suffixes, number of acceptable standard prefixes, acceptable suffixes present in the word, acceptable prefixes present in the word, word length, word frequency, digit feature, symbol feature, reduplicated MWE, POS} The Table II shows the recall, precision and f-measure of the system. TABLE I. BEST RESULT 3. CO NCLUSIONS So far, the chunking work on Manipuri is not reported and this work can be a starting point for the future. Other algorithms for the improvement of the score can also be worked on. The main handicap with this language is its highly agglutinative nature. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21% which has lot of rooms for improvement. REFERENCES [1] Fei Sha and Fernando Pereira,“Shallow Parsing with Conditional Random Fields”.In the Proceedings of HLT-NAACL 2003. [2] John Lafferty, Andrew McCallum and Fernando Pereira, Conditional Random Fields: Probabilistic Models for Segment-ing and Labeling Sequence Data. [3] S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and C. Tenny, editors, Principle-based Parsing. Kluwer Academic Publishers, 1991. [4] S. Geman and M. Johnson. Dynamic programming for parsing and estimation of stochastic uni_cation-based grammars. In Proc. 40th ACL, 2002. [5] A. Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. In C. Cardie and R. Weischedel, editors, EMNLP-2. ACL, 1997. [6] E. F. T. K. Sang. Memory-based shallow parsing. Journal of Machine Learning Research, 2:559.594, 2002. [7] T. Zhang, F. Damerau, and D. Johnson. Text chunking based on a generalization of winnow. Journal of Machine Learning Research, 2:615.637, 2002. [8] Lafferty, J., McCallum, A., Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In the Procceedings of the 18th ICML01, Williamstown, MA, USA., 2001, p. 282-289. [9] Kishorjit, N. and Sivaji, B., “A Transliteration of CRF Based Manipuri POS Tagging”, In the Proceedings of 2nd International Conference on Communication, Computing Security (ICCCS- 2012), Elsevier Ltd, 2012 Model Recall Precision F-Score CRF 71.30 77.36 74.21
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 127 [10] Kishorjit, N., Bishworjit, S., Romina, M., Mayekleima Chanu, Ng. Sivaji, B., (2011) A Light Weight Manipuri Stemmer, In the Proceedings of Natioanal Conference on Indian Language Computing (NCILC), Chochin, India [11] Kishorjit Nongmeikapam, Nonglenjaoba L., Nirmal Y. Sivaji Bandhyopadhyay, Reduplicated MWE (RMWE) Helps in Improving the CRF Based Manipuri POS Tagger, International Journal of Information Technology Convergence and Services (IJITCS) Vol.2, No.1, DOI : 10.5121/ijitcs.2012.2106, 2012, p.45-59. Authors Kishorjit Nongmeikapam is working as Asst. Professor at Department of Computer Science and Engineering, MIT, Manipur University, India. He has completed his BE from PSG college of Tech., Coimbatore and has completed his ME from Jadavpur University, Kolkata, India. He is presently doing research in the area of Multiword Expression and its applications. He has so far published 30 papers and presently handling a Transliteration project funded by DST, Govt. of Manipur, India. He is the author of the Book, “See the C Programming Language”. Chiranjiv Chingangbam is presently a student of Manipur Institute Of Technology. He is pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is NLP. Nepoleon Keisham is presently a student of Manipur Institute Of Technology. He is pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is NLP. Biakchnungnunga Varte is presently a student of Manipur Institute Of Technology. He is pursuing his B.E. in Dept. of Computer Science and Engineering. His area of interest is NLP. Sivaji Bandyopadhyay is working as a Professor since 2001 in the Computer Science and Engineering Department at Jadavpur University, Kolkata, India. His research interests include machine translation, sentiment analysis, textual entailment, question answering systems and information retrieval among others. He is currently supervising six national and international level projects in various areas of language technology. He has published a large number of journal and conference publications.