International Journal of Electrical and Computer Engineering (IJECE)
Vol. 13, No. 5, October 2023, pp. 5632~5640
ISSN: 2088-8708, DOI: 10.11591/ijece.v13i5.pp5632-5640  5632
Journal homepage: http://ijece.iaescore.com
Identification of monolingual and code-switch information from
English-Kannada code-switch data
Ramesh Chundi1
, Vishwanath R. Hulipalled2
, Jay Bharthish Simha3
1
School of Computer Science and Applications, REVA University, Bangalore, India
2
School of Computing and Information Technology, REVA University, Bangalore, India
3
CTO and RACE Labs, Abiba Systems, REVA University, Bangalore, India
Article Info ABSTRACT
Article history:
Received Jul 16, 2022
Revised Oct 10, 2022
Accepted Mar 9, 2023
Code-switching is a very common occurrence in social media
communication, predominantly found in multilingual countries like India.
Using more than one language in communication is known as code-
switching or code-mixing. Some of the important applications of code-
switch are machine translation (MT), shallow parsing, dialog systems, and
semantic parsing. Identifying code-switch and monolingual information is
useful for better communication in online networking websites. In this paper,
we performed a character level n-gram approach to identify monolingual and
code-switch information from English-Kannada social media data. We
paralleled various machine learning techniques such as naïve Bayes (NB),
support vector classifier (SVC), logistic regression (LR) and neural network
(NN) on English-Kannada code-switch (EKCS) data. From the proposed
approach, it is observed that the character level n-gram approach provides
1.8% to 4.1% of improvement in terms of Accuracy and 1.6% to 3.8% of
improvement in F1-score. Also observed that SVC and NN techniques are
outperformed in terms of accuracy (97.9%) and F1-score (98%) with
character level n-gram.
Keywords:
Character level n-gram
Code-switch text
English-Kannada
Machine learning techniques
Monolingual text
This is an open access article under the CC BY-SA license.
Corresponding Author:
Ramesh Chundi
School of Computer Science and Applications, Research Scholar, REVA University
Rukmini Knowledge Park, Kattigenahalli, Srinivasa Nagar, Yelahanka, Bangalore-560064, India
Email: chundiramesh@gmail.com
1. INTRODUCTION
Social networking websites like YouTube, Facebook and Twitter are providing online
communication without boundaries for internet users. Communication in social media can be either in
code-switch or monolingual. Referring two or more languages in communication is known as code-mixing or
code-switching [1]. Code-switch text appears frequently in social media text since the users are not
mandatory to follow any linguistic or grammatical rules [2]–[4]. Also code-switch can be used to decrease
the social distance between people or to draw attention in social media [5].
There are three types of code-switch text in social media, namely word level code-switch,
intra-sentential code-switch and inter-sentential code-switch: i) word level code-switch: user’s switch from
one language to another language within a word; ii) intra-sentential code-switch: user’s switch from one
language to another language within the sentence; and iii) inter-sentential code-switch: user’s switch from
one language to another language in between the sentences.
According to the leading newspaper the times of India report (Nov 7th
, 2018), 52% of Indians are
bilingual (at least they can read and write two languages) and 18% are multilingual (can read and write more
than two languages) [6]. Kannada is one of the popular and the oldest Dravidian language in south India.
Int J Elec & Comp Eng ISSN: 2088-8708 
Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi)
5633
56.9 million speakers use Kannada and it is the 8th
most spoken language in India. Due to the impact of English
language, internet users are merging English and Kannada (sentence level) or Kannada writing in English (word
level) in social media communication through Tweets, WhatsApp and Facebook.
Some of the examples to illustrate monolingual and code-switch text are discussed below. [E1] and
[E2] are the examples for monolingual text, [E1] is a pure Kannada monolingual text and [E2] is a pure English
monolingual text. In monolingual text, both script and source languages belong to the same language.
− [E1] ಗೀರಿಶ ಕಾರ್ನಾಡ್ ಅವರು ಆತ್ಮ ಕ್ಕೆ ಶಾಂತಿ ಸಿಗಲಿ ಎಾಂದು ಹಾರೈಸುತ್
ತ ೀನೆ
Translation: Let Girish Karnad soul rest in peace.
− [E2] I am proud of you I’m fan of you from this minute.
− [E3] Sir ಆದಷ್ಟು ಈ ಸಾಂಗ್ ನ್ ರ್ನನ್ share ಮಾಡ್
ತ ೀನಿ. Nice song
Translation: sir I will try my best to share the song. Nice song
− [E4] Dekshakkagi tyaga madiddare jeevana
Translation: Sacrifice the life for country.
There are various ways to write the code-switch text in social networking websites. For example,
[E3] refers to one way of writing the code-switch text (English and Kannada both the languages are used in
the same sentence). [E4] is another way of writing the code-switch text (source and script both are different),
source language is Kannada and scripting language is English. The relevant literature on this topic has been
carried out by several researchers who experimented on language identification (identification of
monolingual and code-switch information) in the recent past. Research on code-switch started from the year
1970, and few hypotheses are proposed. This leads to the motivations behind the study of code-switching.
Some of the examples includes the markedness model [3], diglossia [7], communication accommodation
theory [8] and conversational analysis model [9].
Ahmed et al. [10] proposed an efficient way for language identification using a cumulative
frequency addition (CFA) of N-grams. However, in this approach more testing is required on large datasets to
evaluate the performance of CFA. Rosner and Farrugia [11] discussed language identification in English-
Maltese code-mixed data and achieved nearly 95% of the accuracy. However, numeric and punctuation
entities were completely ignored in this approach. If it includs these, which will help to build more accurate
model. Solorio and Liu [12] predicted the possible code-switch points in Spanish-English code-switch data.
They trained various learning algorithms using transcription of code-switch speech. The average values for
the code-switch sentences produced by machine learning approach were near to the values of those produced
by the humans. Further, the accuracy can be improved by including a multi-word expression recognition
system. Piergallini et al. [13] performed word-level language identification and prediction of code-switch
point from Swahili-English code-switch data. The proposed approach achieved high accuracy in language
identification and moderate improvement in code-switch point prediction. This approach needs to focus on
social analysis of code-switch behaviour like the association between linguistic accommodation and power or
code-switch and social solidarity.
Yirmibeşoğlu and Eryiğit [14] proposed a system to identify the code-switching between Turkish-
English by using character level n-gram and conditional random fields (CRFs) have achieved 95.6% of
micro-average F1-score. However, still there is a scope for improvement in F1-score by increasing the corpus
size. Barman et al. [15] presented a preliminary study on instinctive language identification with Indian
language code-mixing from social media messages. They performed word-level classification on
Bengali-Hindi-English code-mixing data and concluded that character n-gram features are useful for
language identification in code mixed data. Word-level code-mixing completely ignored in this work.
Veena et al. [16] developed a system for word-level language identification from Tamil-English and
Malayalam-English code-mixed data. This approach is executed based on character-based embedding with
context information and achieved 93% of accuracy on Malayalam-English and 95% of accuracy on
Tamil-English code-mixed data. Further, more features can be used to improve the system performance.
Mave et al. [17] examined different code-switching metrics and found that CRF model performed
better with the boundary of 2-5 percentage for Spanish-English and 3-5 percentage for Hindi-English in
comparison with deep learning model. This work can be extended to match the code-switching manners from
various domains like chat conversations, song lyrics, and movie scripts across various language pairs.
Gundapu and Mamidi [18] presented a study on different models for language identification in
English-Telugu code-mixed data. It is found that CRF model outperformed for word-level language
identification with 0.91 F1-score. Mandal and Singh [19] tested a multichannel neural networks on two
different code-mixed languages such as Bengali-English and Hindi-English. They attained 93.28% of
accuracy on Bengali-English code-mixed data and 93.32% of accuracy on Hindi-English code-mixed data.
Singh et al. [20] build an automatic named entity recognition (NER) system for Hindi-English
code-mixed data and the proposed system outperformed with 33.18% of F1-score in comparison with
existing baseline systems. However, this work can be extended to build natural language processing (NLP)
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640
5634
models like entity-specific sentiment analysers or semantic role labelling which use of NER for code-mixed
data. Das et al. [21] presented a supervised learning model for word-level language identification in Bengali-
English code-mixed data. Two types of word encoding methods such as character and phonetic are used
along with stacking and threshold techniques. The stacking method achieved 91.78% of accuracy and
Threshold method achieved 92.35%. Chaitanya et al. [22] tried to identify different languages from Facebook
code-mixed comments and compared two different word embedding methods such as continuous bag of word
(CBOW) and skip-gram. Shekhar et al. [23] applied two different assessment models such as statistical and
neural-based learning models on Hindi-English code-mixed data. The outcome of proposed model illustrates
that the word embedding is capable to spot the language parting by identifying source of the word and
similarly mapping to its language label.
Very few researchers are focused for language identification in English-Kannada code-switch
(EKCS) data. Lakshmi and Shambhavi [24] addressed the problem of word-level language identification for
English-Kannada code-mixed data. Performed various supervised classifiers and found that the dictionary
based model is better to handle word-level code-mixed in English-Kannada data. However, identifying
monolingual and code-switch information from EKCS data problem needs to be addressed. James et al. [25]
provided an investigational evidence to demonstrat that the accuracy of cloud-based multilingual systems
(Google and Microsoft Azure) is low when identifying Maori language. The proposed study shows that, the
BiLSTM with bilingual embeddings to identify Maori-English code-switching points with an accuracy of
87%. Hybrid models using hand-crafted rules based on the phonotactic variances between the deep learning
techinques and languages can improve the performance of the proposed approach.
2. METHOD
2.1. Pre-processing and annotation
To implement monolingual and code-switch information identification task, 10,396 EKCS
comments are collected from YouTube.com and these comments are written using English script. There are
two types of comments in the collected dataset, comments that are having the combination of monolingual
words (i.e., English words written in English script) and code-switch words (i.e., Kannada words written in
English script) and the comments that are completely with code-switch words. Pre-processing is a
fundamental and important task in NLP, since the performance of any model depends on quality data.
To generate quality data, it is required to remove the noisy data from the dataset such as unwanted symbols,
special characters, and digits since this noisy data will not play any significance role in some of the tasks like
language identification and part of speech (POS) tagging. Hence, we performed pre-processing task on EKCS
dataset to remove noisy data. Frist, removed digits, special characters, and emojiees from the dataset and then
converted into lower case to bring the uniformity in the text. After the pre-processing, tokenization task is
carried out to split each comment into tokens. Finally, 123,249 tokens are presented in the EKCS corpus out
of 10,396 comments.
Once the tokens are generated, next step is annotation i.e., assign each token to its relevant class.
Annotation is required to train and test the performance of supervised techniques in machine learning.
The entire corpus is annotated with six classes such as monolingual (MN), code-switch (CS), names (NE),
mix of monolingual and code-switch as MIX, location (LC), and remaining all other tokens as unknown
(UN). Table 1 shows the sample data of EKCS corpus. Table 2 shows the statistics (number of tokens in each
class) of EKCS corpus. Motive of this work is to identify monolingual and code-switch information, so the
corpus has a smaller number of tokens in NE, LC and UN classes in comparison with code-switch and
monolingual classes and particularly number of tokens in MIX class is the least (448) since the mix of
monolingual and code-switch is a rare occurrence in social media text.
2.2. Proposed methodology
The proposed approach called character level n-gram is discussed in this section to identify
monolingual and code-switch information from EKCS corpus. Four types of supervised learning approaches
such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR), and neural networks
(NN) are implemented with two types of feature extraction such as word-level term frequency-inverse
document frequency (TF-IDF) and character level n-gram. To implement proposed method, the EKCS corpus
needs to be divided into two parts such as train dataset and test dataset. Performed 80:20 split by using
sklearn.model_selection.tain_test_split on the EKCS corpus. Here, 80% of data is used to train the various
models and 20% of the data is used to test the performance of the models. Figure 1 shows the proposed
approach for language identification. Totally 24,650 tokens are used as input to test the performance of
proposed model and the input dataset is pre-processed. Table 3 shows the statistics of input dataset.
Int J Elec & Comp Eng ISSN: 2088-8708 
Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi)
5635
Table 1. Sample of EKCS corpus
Token Label
Thumballa CS
Super MN
Rama NE
Correctagi MIX
Karnataka LC
Congi UN
Table 2. Statistics of EKCS corpus
Class Value
CS 85,178
MN 26,326
NE 7,048
MIX 448
LC 1,360
UN 2,889
Total No of Tokens 123, 249
Figure 1. Language identification
Table 3. Statistics of test data
Class Value
CS 16,983
MN 5,313
NE 1,350
MIX 96
LC 278
UN 630
In next step, feature extraction from the input dataset is carried out. The process of transforming the
text data into features is called feature extraction. The TF-IDF is a technique which provides numerical
values for text data and also gives the significance of a specific words in the corpus. Calculating term
frequency is the first step in TF_IDF process and this can be done by using the (1).
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640
5636
𝑇𝐹 =
𝑊
𝑁
(1)
where, W is the frequency of a word in the document, and N is the total no of words in the document.
Once the term frequency is calculated, next step is to calculate the inverse document frequency and
this can be done by using the (2).
I𝐷𝐹 = log(𝐷 𝑊𝐷
⁄ ) (2)
where, D is the total no of documents, and WD is the no of documents containing the word. Finally, we can
calculate TF-IDF value by using the (3).
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹 (3)
A character level n-gram is an order of n characters in a word and n is the number of characters in the
sequence. Algorithm 1 shows the process of generating character level n-grams for each token. Algorithm 2
shows the process of features extraction by using char_ngrams( ) and byte-pair encoding (bpemb_en.encode())
functions.
Algorithm 1. Character level n-gram generation
Input: word, n/*word–input text and n–Number of characters in the sequence */
Output: Character Level N-gram
/*Generating Character N-grams */
char_ngrams(word, n)
{
cng=[]/* is empty list */
/* Left and Right padding */
cng.append(list(ngrams(word, n, pad_left=True, pad_right=True,
left_pad_symbol=’_’,right_pad_symbol=’_’)))
/* Remove duplicate combinations */
if(n>2)
{
rc=n-2
cng=[ng_list [rc:-rc] for cng in cng]
}
ng_tupl=[ngram for cng in cng for ngram in cng]
form_string=‘ ’
for i in range (0,n)
form_string+=('%s')
ng_tupl=[form_string % ngram_tuple for ngram_tuple in ng_tupl]
return ng_tupl
}
Algorithm 2. Extraction of Word2features
Input: w/* input word */
Output: Features of the word
/* Features Extraction */
word2features(w)
{
features={'word': word}
tmp=list(features.values())+char_ngrams(w)
for sw in bpemb_en.encode(w)
if sw.startswith('▁')
sw=sw[1:]
tmp.extend(char_ngrams(sw))
return tmp
}
Int J Elec & Comp Eng ISSN: 2088-8708 
Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi)
5637
3. RESULTS AND DISCUSSION
Various supervised classification methods such as NB, SVC, LR and NN are compared with two
feature extractions such as word level TF-IDF and character level n-gram. The comparison is made based on
two parameters such as accuracy and F1-score. It is observied that, the proposed method character level
n-gram is more efficient interms of accuracy and F1-Score in comparison with word level
TF-IDF.
3.1. Word level TF-IDF
Table 4 shows the comparison of word level TF-IDF F1-score for various classifiers. It is observed
that, all four classifiers are producing almost similar F1-score value with variation of about 1% for
code-switch class. LR produces slightly less F1-score i.e., 93% in comparison with other classifiers for
monolingual class. Surprisingly NB classifier produces 0% as shown in Table 4 for mixed class, since the
number of tokens is less about 96 as shown in Table 3 in comparison with other classes. Further, rest of the
classes, NB produces less F1-score in comparison with other classifiers as per Table 4 since NB works well for
large datasets.
Table 4. F1-score of word level TF-IDF
Class F1-Score
NB SVC LR NN
CS 96 97 96 97
MN 96 96 93 96
NE 76 87 79 87
MIX 00 36 08 36
LC 66 92 79 92
UN 62 89 77 89
3.2. Character level n-gram
Table 5 shows the comparison of character level n-gram F1-score for various classifiers. It is
observed that, all four classifiers are producing almost similar F1-score for code-switch class. NB produces
slightly less F1-score (96%) value in comparison with other classifiers for monolingual class as shown in
Table 5. NB and LR performed better i.e., 47 and 42% respectively for mixed class in comparison with word
level TF-IDF. If we considered overall, there is an improvement in F1-score with character level n-gram for all
the classes in comparison with word level TF-IDF.
Table 5. F1-Score of character level n-gram
Class F1-Score
NB SVC LR NN
CS 98 99 99 99
MN 96 98 98 98
NE 84 92 91 92
MIX 47 62 50 61
LC 91 96 95 94
UN 81 89 89 90
Table 6 shows the F1-score comparison between word level TF-IDF and character level n-gram for
each classifier. SVC and NN performed better with character level n-gram feature extraction (98%) in
comparison with word level TF-IDF followed by LR classifier (97.9%) and NB (96.1%). Figure 2 shows the
improvement of F1-score for various supervised classifiers. From the Figure 2, it is evident that a good amount
of improvement in LR classifier with character level n-gram (an improvement of 3.8%) followed by SVC and
NN with 1.7% and NB with 1.6%.
Table 6. F1-Score of word level TF-IDF and character level n-gram
Classifier Word level TF-IDF Character level N-Gram Improvement
Naïve Bayes 94.5 96.1 1.6
SVC 96.3 98.0 1.7
Logistic regression 94.1 97.9 3.8
Neural networks 96.3 98.0 1.7
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640
5638
Figure 2. F1-Score improvement
Table 7 shows the accuracy comparison of word level TF-IDF and character level n-gram for each
classifier. SVC and NN are performed better with character level n-gram feature extraction (97.9%) in
comparison with word level TF-IDF followed by LR classifier (97.8%) and NB (96%). Figure 3 shows the
improvement of accuracy for various supervised classifiers and it is evident that a good amount of
improvement in LR classifier with character level n-gram (an improvement of 4.1%) followed by NB (2.2%),
SVC and NN are with 1.8% respectively.
Table 7. Accuracy of word level TF-IDF and character level n-gram
Classifier Word Level TF-IDF Character Level N-Gram Improvement
Naïve Bayes 93.8 96.0 2.2
SVC 96.1 97.9 1.8
Logistic Regression 93.7 97.8 4.1
Neural Networks 96.1 97.9 1.8
Figure 3. Accuracy improvement
4. CONCLUSION
In this work, we performed “Identification of monolingual and code-switch information from
English-Kannada code-switch data” and conducted an experiment with various supervised classifiers by using
two feature extraction techniques such as word level TF-IDF and character level n-gram. From the proposed
Int J Elec & Comp Eng ISSN: 2088-8708 
Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi)
5639
approach, it is observed that there is a considerable amount of improvement in F1-score and accuracy with
character level n-gram for all classifiers in comparison with word level TF-IDF. In terms of F1-score, LR with
3.8% followed by SVC and NN are with 1.7% and NB with 1.6% improvement. In terms of accuracy, LR with
4.1% followed by NB with 2.2%, SVC and NN are with 1.8% improvement. Considering overall performance,
SVC and NN are performed better (with 98% of F1-score and 97.9% of accuracy) in comparison with LR
(with 97.9% of F1-score and 97.8% of accuracy) and NB (with 96.1% of F1-score and 96% of accuracy). In
future, we are planning to develop an automatic language identification system for English-Kannada code-
switch data.
ACKNOWLEDGEMENTS
The authors want to thank the REVA University management for their support of this research
activity.
REFERENCES
[1] J. Lipski, “Code-switching and the problem of bilingual competence,” Aspects of bilingualism, vol. 250, 1978.
[2] M. Gysels, “French in urban Lubumbashi Swahili: Codeswitching, borrowing, or both?,” Journal of Multilingual and
Multicultural Development, vol. 13, pp. 41–55, Jan. 1992, doi: 10.1080/01434632.1992.9994482.
[3] E. McClure, “Duelling languages: grammatical structure in codeswitching,” Studies in Second Language Acquisition, vol. 17,
no. 1, pp. 117–118, Mar. 1995, doi: 10.1017/S027226310001408X.
[4] M. G. Moyer, “Bilingual speech: A typology of code-mixing,” Language in Society, vol. 31, no. 4, pp. 621–624, Oct. 2002,
doi: 10.1017/S004740450224405X.
[5] C. Myers-Scotton, Social motivations for codeswitching: Evidence from Africa. Oxford University Press, 1995.
[6] R. Nagarajan, “52% of India’s urban youth are now bilingual, 18% speak three languages,” The Times of India, 2018.
[7] J. J. Gumperz and J.-P. Blom, “Social meaning in linguistic structures: Code-switching in Norway,” Directions in sociolinguistics,
pp. 407–434, 1971.
[8] H. Giles and R. N. St. Clair, Language and social psychology. Edition of Language and Social Psychology, 1979.
[9] P. Auer, Ed., Code-switching in conversation. Routledge, 2013.
[10] B. Ahmed, S.-H. Cha, and C. Tappert, “Language identification from text using n-gram based cumulative frequency addition,”
Proceedings of Student/Faculty Research Day, CSIS, Pace University, vol. 12, 2004.
[11] M. Rosner and P.-J. Farrugia, “A tagging algorithm for mixed language identification in a noisy domain.” Eighth Annual
Conference of the International Speech Communication Association, pp. 190–193, 2007.
[12] T. Solorio and Y. Liu, “Learning to predict code-switching points,” Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing, 2008, doi: 10.3115/1613715.1613841.
[13] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha, “Word-level language identification and predicting codeswitching
points in Swahili-English language data,” in Proceedings of the Second Workshop on Computational Approaches to Code
Switching, 2016, pp. 21–29, doi: 10.18653/v1/W16-5803.
[14] Z. Yirmibeşoğlu and G. Eryiğit, “Detecting code-switching between Turkish-English language pair,” in Proceedings of the 2018
EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, 2018, pp. 110–115, doi: 10.18653/v1/W18-6115.
[15] U. Barman, A. Das, J. Wagner, and J. Foster, “Code mixing: a challenge for language identification in the language of social
media,” in Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 13–23,
doi: 10.3115/v1/W14-3902.
[16] P. V. Veena, M. A. Kumar, and K. P. Soman, “An effective way of word-level language identification for code-mixed facebook
comments using word-embedding via character-embedding,” in 2017 International Conference on Advances in Computing,
Communications and Informatics (ICACCI), Sep. 2017, pp. 1552–1556, doi: 10.1109/ICACCI.2017.8126062.
[17] D. Mave, S. Maharjan, and T. Solorio, “Language identification and analysis of code-switched social media text,” in Proceedings
of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018, pp. 51–61, doi: 10.18653/v1/W18-
3206.
[18] S. Gundapu and R. Mamidi, “Word level language identification in English Telugu code mixed data,” arXiv preprint
arXiv:2010.04482, 2018.
[19] S. Mandal and A. K. Singh, “Language identification in code-mixed data using multichannel neural networks and context
capture,” in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, 2018,
pp. 116–120, doi: 10.18653/v1/W18-6116.
[20] K. Singh, I. Sen, and P. Kumaraguru, “Language identification and named entity recognition in hinglish code mixed tweets,” in
Proceedings of ACL 2018, Student Research Workshop, 2018, pp. 52–58, doi: 10.18653/v1/P18-3008.
[21] S. D. Das, S. Mandal, and D. Das, “Language identification of Bengali-English code-mixed data using character and phonetic
based LSTM models,” in Proceedings of the 11th Forum for Information Retrieval Evaluation, Dec. 2019, pp. 60–64,
doi: 10.1145/3368567.3368578.
[22] I. Chaitanya, I. Madapakula, S. K. Gupta, and S. Thara, “Word level language identification in code-mixed data using word
embedding methods for Indian languages,” in 2018 International Conference on Advances in Computing, Communications and
Informatics (ICACCI), Sep. 2018, pp. 1137–1141, doi: 10.1109/ICACCI.2018.8554501.
[23] S. Shekhar, D. K. Sharma, and M. M. Sufyan Beg, “An effective cybernated word embedding system for analysis and language
identification in code-mixed social media text,” International Journal of Knowledge-based and Intelligent Engineering Systems,
vol. 23, no. 3, pp. 167–179, Oct. 2019, doi: 10.3233/KES-190409.
[24] B. S. S. Lakshmi and B. R. Shambhavi, “An automatic language identification system for code-mixed English-Kannada social
media text,” in 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable
Solution (CSITSS), Dec. 2017, pp. 1–5, doi: 10.1109/CSITSS.2017.8447784.
[25] J. James et al., “Language models for code-switch detection of te reo Māori and English in a low-resource setting,” in Findings of
the Association for Computational Linguistics: NAACL 2022, 2022, pp. 650–660, doi: 10.18653/v1/2022.findings-naacl.49.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640
5640
BIOGRAPHIES OF AUTHORS
Ramesh Chundi received the B.Sc. degree in computer science and MCA degree
from Sri Venkateswara University, India, in 2004 and 2007, respectively. Currently pursuing
Ph.D. degree in computer science and Applications from REVA University, India. His research
interests include natural language processing (NLP), aritificial intelligence (AI), machine
learning, deep learning, data analytics, and data mining. He can be contacted at email:
chundiramesh@gmail.com.
Vishwanath R. Hulipalled is a Professor in the School of Computing and IT,
REVA University, Bangalore, Karnataka, India. He completed BE, ME and Ph.D. in Computer
Science and Engineering. His area of Interests includes machine learning, natural language
processing, data analytics and time series mining. He has more than 24 years of academic
experience and research. He authored more than 50 research articles in reputed journals and
conference proceedings. He can be contacted at email: vishwanth.rh@reva.edu.in.
Jay Bharthish Simha is the CTO of ABIBA Systems and Chief Mentor at RACE
Labs, REVA University. He completed his BE (Mech), M.Tech (Mech), M.Phil.(CS) and
Ph.D.(AI). His area of interest includes fuzzy logic, soft computing, machine learning, deep
learning, and applications. He has more than 20 years of industrial experience and 4 years of
academic experience. He has authored/co-authored more than 50 journal/conference
publications. He can be contacted at email: jay.b.simha@reva.edu.in and
jay.b.simha@abibasystems.com.

More Related Content

PDF
A Survey Of Current Datasets For Code-Switching Research
PPTX
Code switching and mixing in internet Chatting
PPTX
C.s & c.m
PPTX
MixedLanguageProcessingTutorialEMNLP2019.pptx
PDF
Social Motivation for Code Switching in Sujatha Novels in Tamil
PDF
Kurdish-English Code-Switching in Facebook
PPTX
Codeswitching by KRISHNAKSHI THAKURIA
PDF
A case study of code switching in trilingual conversations
A Survey Of Current Datasets For Code-Switching Research
Code switching and mixing in internet Chatting
C.s & c.m
MixedLanguageProcessingTutorialEMNLP2019.pptx
Social Motivation for Code Switching in Sujatha Novels in Tamil
Kurdish-English Code-Switching in Facebook
Codeswitching by KRISHNAKSHI THAKURIA
A case study of code switching in trilingual conversations

Similar to Identification of monolingual and code-switch information from English-Kannada code-switch data (20)

PDF
A case study of code switching in trilingual conversations
PDF
Code Mixing computationally bahut challenging hai
PDF
Grammatical Theory and Bilingual Codeswitching 1st Edition Jeff Macswan
PPTX
Exploring Code-Switching Variation in Bangla-English Messenger Chats: A Case ...
PPTX
Cmc assignment, adilla's group, sec2
PDF
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
PPT
Code Switching: a paper by Krishna Bista
PDF
A LINGUISTIC ANALYSIS OF TAGALOG-ENGLISH CODE SWITCHING IN OPM LOVE SONGS
PDF
55 2004-bilingualism
PDF
Advantages Of Code Switching
PPTX
Code switching
PDF
An Exploration Of Perceptions And Intentions Of Code-Switching Among Bilingua...
PPTX
Presentation on Code Switching
PPTX
CODE SWITCHING & Mixing.pptx
PPTX
Code switcing (1)
PPTX
Code Switching and Code Mixing
PPTX
Mixture varieties
PPTX
Proposal-Defense-Group. 3.pptx
PDF
Integration of Phonotactic Features for Language Identification on Code-Switc...
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
A case study of code switching in trilingual conversations
Code Mixing computationally bahut challenging hai
Grammatical Theory and Bilingual Codeswitching 1st Edition Jeff Macswan
Exploring Code-Switching Variation in Bangla-English Messenger Chats: A Case ...
Cmc assignment, adilla's group, sec2
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
Code Switching: a paper by Krishna Bista
A LINGUISTIC ANALYSIS OF TAGALOG-ENGLISH CODE SWITCHING IN OPM LOVE SONGS
55 2004-bilingualism
Advantages Of Code Switching
Code switching
An Exploration Of Perceptions And Intentions Of Code-Switching Among Bilingua...
Presentation on Code Switching
CODE SWITCHING & Mixing.pptx
Code switcing (1)
Code Switching and Code Mixing
Mixture varieties
Proposal-Defense-Group. 3.pptx
Integration of Phonotactic Features for Language Identification on Code-Switc...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...
Ad

Recently uploaded (20)

PPTX
Software Engineering and software moduleing
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PPTX
introduction to high performance computing
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
communication and presentation skills 01
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPT
Total quality management ppt for engineering students
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Software Engineering and software moduleing
Module 8- Technological and Communication Skills.pptx
August 2025 - Top 10 Read Articles in Network Security & Its Applications
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
introduction to high performance computing
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
737-MAX_SRG.pdf student reference guides
communication and presentation skills 01
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
August -2025_Top10 Read_Articles_ijait.pdf
Management Information system : MIS-e-Business Systems.pptx
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Total quality management ppt for engineering students
Exploratory_Data_Analysis_Fundamentals.pdf
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
"Array and Linked List in Data Structures with Types, Operations, Implementat...

Identification of monolingual and code-switch information from English-Kannada code-switch data

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 13, No. 5, October 2023, pp. 5632~5640 ISSN: 2088-8708, DOI: 10.11591/ijece.v13i5.pp5632-5640  5632 Journal homepage: http://ijece.iaescore.com Identification of monolingual and code-switch information from English-Kannada code-switch data Ramesh Chundi1 , Vishwanath R. Hulipalled2 , Jay Bharthish Simha3 1 School of Computer Science and Applications, REVA University, Bangalore, India 2 School of Computing and Information Technology, REVA University, Bangalore, India 3 CTO and RACE Labs, Abiba Systems, REVA University, Bangalore, India Article Info ABSTRACT Article history: Received Jul 16, 2022 Revised Oct 10, 2022 Accepted Mar 9, 2023 Code-switching is a very common occurrence in social media communication, predominantly found in multilingual countries like India. Using more than one language in communication is known as code- switching or code-mixing. Some of the important applications of code- switch are machine translation (MT), shallow parsing, dialog systems, and semantic parsing. Identifying code-switch and monolingual information is useful for better communication in online networking websites. In this paper, we performed a character level n-gram approach to identify monolingual and code-switch information from English-Kannada social media data. We paralleled various machine learning techniques such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR) and neural network (NN) on English-Kannada code-switch (EKCS) data. From the proposed approach, it is observed that the character level n-gram approach provides 1.8% to 4.1% of improvement in terms of Accuracy and 1.6% to 3.8% of improvement in F1-score. Also observed that SVC and NN techniques are outperformed in terms of accuracy (97.9%) and F1-score (98%) with character level n-gram. Keywords: Character level n-gram Code-switch text English-Kannada Machine learning techniques Monolingual text This is an open access article under the CC BY-SA license. Corresponding Author: Ramesh Chundi School of Computer Science and Applications, Research Scholar, REVA University Rukmini Knowledge Park, Kattigenahalli, Srinivasa Nagar, Yelahanka, Bangalore-560064, India Email: chundiramesh@gmail.com 1. INTRODUCTION Social networking websites like YouTube, Facebook and Twitter are providing online communication without boundaries for internet users. Communication in social media can be either in code-switch or monolingual. Referring two or more languages in communication is known as code-mixing or code-switching [1]. Code-switch text appears frequently in social media text since the users are not mandatory to follow any linguistic or grammatical rules [2]–[4]. Also code-switch can be used to decrease the social distance between people or to draw attention in social media [5]. There are three types of code-switch text in social media, namely word level code-switch, intra-sentential code-switch and inter-sentential code-switch: i) word level code-switch: user’s switch from one language to another language within a word; ii) intra-sentential code-switch: user’s switch from one language to another language within the sentence; and iii) inter-sentential code-switch: user’s switch from one language to another language in between the sentences. According to the leading newspaper the times of India report (Nov 7th , 2018), 52% of Indians are bilingual (at least they can read and write two languages) and 18% are multilingual (can read and write more than two languages) [6]. Kannada is one of the popular and the oldest Dravidian language in south India.
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi) 5633 56.9 million speakers use Kannada and it is the 8th most spoken language in India. Due to the impact of English language, internet users are merging English and Kannada (sentence level) or Kannada writing in English (word level) in social media communication through Tweets, WhatsApp and Facebook. Some of the examples to illustrate monolingual and code-switch text are discussed below. [E1] and [E2] are the examples for monolingual text, [E1] is a pure Kannada monolingual text and [E2] is a pure English monolingual text. In monolingual text, both script and source languages belong to the same language. − [E1] ಗೀರಿಶ ಕಾರ್ನಾಡ್ ಅವರು ಆತ್ಮ ಕ್ಕೆ ಶಾಂತಿ ಸಿಗಲಿ ಎಾಂದು ಹಾರೈಸುತ್ ತ ೀನೆ Translation: Let Girish Karnad soul rest in peace. − [E2] I am proud of you I’m fan of you from this minute. − [E3] Sir ಆದಷ್ಟು ಈ ಸಾಂಗ್ ನ್ ರ್ನನ್ share ಮಾಡ್ ತ ೀನಿ. Nice song Translation: sir I will try my best to share the song. Nice song − [E4] Dekshakkagi tyaga madiddare jeevana Translation: Sacrifice the life for country. There are various ways to write the code-switch text in social networking websites. For example, [E3] refers to one way of writing the code-switch text (English and Kannada both the languages are used in the same sentence). [E4] is another way of writing the code-switch text (source and script both are different), source language is Kannada and scripting language is English. The relevant literature on this topic has been carried out by several researchers who experimented on language identification (identification of monolingual and code-switch information) in the recent past. Research on code-switch started from the year 1970, and few hypotheses are proposed. This leads to the motivations behind the study of code-switching. Some of the examples includes the markedness model [3], diglossia [7], communication accommodation theory [8] and conversational analysis model [9]. Ahmed et al. [10] proposed an efficient way for language identification using a cumulative frequency addition (CFA) of N-grams. However, in this approach more testing is required on large datasets to evaluate the performance of CFA. Rosner and Farrugia [11] discussed language identification in English- Maltese code-mixed data and achieved nearly 95% of the accuracy. However, numeric and punctuation entities were completely ignored in this approach. If it includs these, which will help to build more accurate model. Solorio and Liu [12] predicted the possible code-switch points in Spanish-English code-switch data. They trained various learning algorithms using transcription of code-switch speech. The average values for the code-switch sentences produced by machine learning approach were near to the values of those produced by the humans. Further, the accuracy can be improved by including a multi-word expression recognition system. Piergallini et al. [13] performed word-level language identification and prediction of code-switch point from Swahili-English code-switch data. The proposed approach achieved high accuracy in language identification and moderate improvement in code-switch point prediction. This approach needs to focus on social analysis of code-switch behaviour like the association between linguistic accommodation and power or code-switch and social solidarity. Yirmibeşoğlu and Eryiğit [14] proposed a system to identify the code-switching between Turkish- English by using character level n-gram and conditional random fields (CRFs) have achieved 95.6% of micro-average F1-score. However, still there is a scope for improvement in F1-score by increasing the corpus size. Barman et al. [15] presented a preliminary study on instinctive language identification with Indian language code-mixing from social media messages. They performed word-level classification on Bengali-Hindi-English code-mixing data and concluded that character n-gram features are useful for language identification in code mixed data. Word-level code-mixing completely ignored in this work. Veena et al. [16] developed a system for word-level language identification from Tamil-English and Malayalam-English code-mixed data. This approach is executed based on character-based embedding with context information and achieved 93% of accuracy on Malayalam-English and 95% of accuracy on Tamil-English code-mixed data. Further, more features can be used to improve the system performance. Mave et al. [17] examined different code-switching metrics and found that CRF model performed better with the boundary of 2-5 percentage for Spanish-English and 3-5 percentage for Hindi-English in comparison with deep learning model. This work can be extended to match the code-switching manners from various domains like chat conversations, song lyrics, and movie scripts across various language pairs. Gundapu and Mamidi [18] presented a study on different models for language identification in English-Telugu code-mixed data. It is found that CRF model outperformed for word-level language identification with 0.91 F1-score. Mandal and Singh [19] tested a multichannel neural networks on two different code-mixed languages such as Bengali-English and Hindi-English. They attained 93.28% of accuracy on Bengali-English code-mixed data and 93.32% of accuracy on Hindi-English code-mixed data. Singh et al. [20] build an automatic named entity recognition (NER) system for Hindi-English code-mixed data and the proposed system outperformed with 33.18% of F1-score in comparison with existing baseline systems. However, this work can be extended to build natural language processing (NLP)
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640 5634 models like entity-specific sentiment analysers or semantic role labelling which use of NER for code-mixed data. Das et al. [21] presented a supervised learning model for word-level language identification in Bengali- English code-mixed data. Two types of word encoding methods such as character and phonetic are used along with stacking and threshold techniques. The stacking method achieved 91.78% of accuracy and Threshold method achieved 92.35%. Chaitanya et al. [22] tried to identify different languages from Facebook code-mixed comments and compared two different word embedding methods such as continuous bag of word (CBOW) and skip-gram. Shekhar et al. [23] applied two different assessment models such as statistical and neural-based learning models on Hindi-English code-mixed data. The outcome of proposed model illustrates that the word embedding is capable to spot the language parting by identifying source of the word and similarly mapping to its language label. Very few researchers are focused for language identification in English-Kannada code-switch (EKCS) data. Lakshmi and Shambhavi [24] addressed the problem of word-level language identification for English-Kannada code-mixed data. Performed various supervised classifiers and found that the dictionary based model is better to handle word-level code-mixed in English-Kannada data. However, identifying monolingual and code-switch information from EKCS data problem needs to be addressed. James et al. [25] provided an investigational evidence to demonstrat that the accuracy of cloud-based multilingual systems (Google and Microsoft Azure) is low when identifying Maori language. The proposed study shows that, the BiLSTM with bilingual embeddings to identify Maori-English code-switching points with an accuracy of 87%. Hybrid models using hand-crafted rules based on the phonotactic variances between the deep learning techinques and languages can improve the performance of the proposed approach. 2. METHOD 2.1. Pre-processing and annotation To implement monolingual and code-switch information identification task, 10,396 EKCS comments are collected from YouTube.com and these comments are written using English script. There are two types of comments in the collected dataset, comments that are having the combination of monolingual words (i.e., English words written in English script) and code-switch words (i.e., Kannada words written in English script) and the comments that are completely with code-switch words. Pre-processing is a fundamental and important task in NLP, since the performance of any model depends on quality data. To generate quality data, it is required to remove the noisy data from the dataset such as unwanted symbols, special characters, and digits since this noisy data will not play any significance role in some of the tasks like language identification and part of speech (POS) tagging. Hence, we performed pre-processing task on EKCS dataset to remove noisy data. Frist, removed digits, special characters, and emojiees from the dataset and then converted into lower case to bring the uniformity in the text. After the pre-processing, tokenization task is carried out to split each comment into tokens. Finally, 123,249 tokens are presented in the EKCS corpus out of 10,396 comments. Once the tokens are generated, next step is annotation i.e., assign each token to its relevant class. Annotation is required to train and test the performance of supervised techniques in machine learning. The entire corpus is annotated with six classes such as monolingual (MN), code-switch (CS), names (NE), mix of monolingual and code-switch as MIX, location (LC), and remaining all other tokens as unknown (UN). Table 1 shows the sample data of EKCS corpus. Table 2 shows the statistics (number of tokens in each class) of EKCS corpus. Motive of this work is to identify monolingual and code-switch information, so the corpus has a smaller number of tokens in NE, LC and UN classes in comparison with code-switch and monolingual classes and particularly number of tokens in MIX class is the least (448) since the mix of monolingual and code-switch is a rare occurrence in social media text. 2.2. Proposed methodology The proposed approach called character level n-gram is discussed in this section to identify monolingual and code-switch information from EKCS corpus. Four types of supervised learning approaches such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR), and neural networks (NN) are implemented with two types of feature extraction such as word-level term frequency-inverse document frequency (TF-IDF) and character level n-gram. To implement proposed method, the EKCS corpus needs to be divided into two parts such as train dataset and test dataset. Performed 80:20 split by using sklearn.model_selection.tain_test_split on the EKCS corpus. Here, 80% of data is used to train the various models and 20% of the data is used to test the performance of the models. Figure 1 shows the proposed approach for language identification. Totally 24,650 tokens are used as input to test the performance of proposed model and the input dataset is pre-processed. Table 3 shows the statistics of input dataset.
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi) 5635 Table 1. Sample of EKCS corpus Token Label Thumballa CS Super MN Rama NE Correctagi MIX Karnataka LC Congi UN Table 2. Statistics of EKCS corpus Class Value CS 85,178 MN 26,326 NE 7,048 MIX 448 LC 1,360 UN 2,889 Total No of Tokens 123, 249 Figure 1. Language identification Table 3. Statistics of test data Class Value CS 16,983 MN 5,313 NE 1,350 MIX 96 LC 278 UN 630 In next step, feature extraction from the input dataset is carried out. The process of transforming the text data into features is called feature extraction. The TF-IDF is a technique which provides numerical values for text data and also gives the significance of a specific words in the corpus. Calculating term frequency is the first step in TF_IDF process and this can be done by using the (1).
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640 5636 𝑇𝐹 = 𝑊 𝑁 (1) where, W is the frequency of a word in the document, and N is the total no of words in the document. Once the term frequency is calculated, next step is to calculate the inverse document frequency and this can be done by using the (2). I𝐷𝐹 = log(𝐷 𝑊𝐷 ⁄ ) (2) where, D is the total no of documents, and WD is the no of documents containing the word. Finally, we can calculate TF-IDF value by using the (3). 𝑇𝐹 − 𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹 (3) A character level n-gram is an order of n characters in a word and n is the number of characters in the sequence. Algorithm 1 shows the process of generating character level n-grams for each token. Algorithm 2 shows the process of features extraction by using char_ngrams( ) and byte-pair encoding (bpemb_en.encode()) functions. Algorithm 1. Character level n-gram generation Input: word, n/*word–input text and n–Number of characters in the sequence */ Output: Character Level N-gram /*Generating Character N-grams */ char_ngrams(word, n) { cng=[]/* is empty list */ /* Left and Right padding */ cng.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol=’_’,right_pad_symbol=’_’))) /* Remove duplicate combinations */ if(n>2) { rc=n-2 cng=[ng_list [rc:-rc] for cng in cng] } ng_tupl=[ngram for cng in cng for ngram in cng] form_string=‘ ’ for i in range (0,n) form_string+=('%s') ng_tupl=[form_string % ngram_tuple for ngram_tuple in ng_tupl] return ng_tupl } Algorithm 2. Extraction of Word2features Input: w/* input word */ Output: Features of the word /* Features Extraction */ word2features(w) { features={'word': word} tmp=list(features.values())+char_ngrams(w) for sw in bpemb_en.encode(w) if sw.startswith('▁') sw=sw[1:] tmp.extend(char_ngrams(sw)) return tmp }
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi) 5637 3. RESULTS AND DISCUSSION Various supervised classification methods such as NB, SVC, LR and NN are compared with two feature extractions such as word level TF-IDF and character level n-gram. The comparison is made based on two parameters such as accuracy and F1-score. It is observied that, the proposed method character level n-gram is more efficient interms of accuracy and F1-Score in comparison with word level TF-IDF. 3.1. Word level TF-IDF Table 4 shows the comparison of word level TF-IDF F1-score for various classifiers. It is observed that, all four classifiers are producing almost similar F1-score value with variation of about 1% for code-switch class. LR produces slightly less F1-score i.e., 93% in comparison with other classifiers for monolingual class. Surprisingly NB classifier produces 0% as shown in Table 4 for mixed class, since the number of tokens is less about 96 as shown in Table 3 in comparison with other classes. Further, rest of the classes, NB produces less F1-score in comparison with other classifiers as per Table 4 since NB works well for large datasets. Table 4. F1-score of word level TF-IDF Class F1-Score NB SVC LR NN CS 96 97 96 97 MN 96 96 93 96 NE 76 87 79 87 MIX 00 36 08 36 LC 66 92 79 92 UN 62 89 77 89 3.2. Character level n-gram Table 5 shows the comparison of character level n-gram F1-score for various classifiers. It is observed that, all four classifiers are producing almost similar F1-score for code-switch class. NB produces slightly less F1-score (96%) value in comparison with other classifiers for monolingual class as shown in Table 5. NB and LR performed better i.e., 47 and 42% respectively for mixed class in comparison with word level TF-IDF. If we considered overall, there is an improvement in F1-score with character level n-gram for all the classes in comparison with word level TF-IDF. Table 5. F1-Score of character level n-gram Class F1-Score NB SVC LR NN CS 98 99 99 99 MN 96 98 98 98 NE 84 92 91 92 MIX 47 62 50 61 LC 91 96 95 94 UN 81 89 89 90 Table 6 shows the F1-score comparison between word level TF-IDF and character level n-gram for each classifier. SVC and NN performed better with character level n-gram feature extraction (98%) in comparison with word level TF-IDF followed by LR classifier (97.9%) and NB (96.1%). Figure 2 shows the improvement of F1-score for various supervised classifiers. From the Figure 2, it is evident that a good amount of improvement in LR classifier with character level n-gram (an improvement of 3.8%) followed by SVC and NN with 1.7% and NB with 1.6%. Table 6. F1-Score of word level TF-IDF and character level n-gram Classifier Word level TF-IDF Character level N-Gram Improvement Naïve Bayes 94.5 96.1 1.6 SVC 96.3 98.0 1.7 Logistic regression 94.1 97.9 3.8 Neural networks 96.3 98.0 1.7
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640 5638 Figure 2. F1-Score improvement Table 7 shows the accuracy comparison of word level TF-IDF and character level n-gram for each classifier. SVC and NN are performed better with character level n-gram feature extraction (97.9%) in comparison with word level TF-IDF followed by LR classifier (97.8%) and NB (96%). Figure 3 shows the improvement of accuracy for various supervised classifiers and it is evident that a good amount of improvement in LR classifier with character level n-gram (an improvement of 4.1%) followed by NB (2.2%), SVC and NN are with 1.8% respectively. Table 7. Accuracy of word level TF-IDF and character level n-gram Classifier Word Level TF-IDF Character Level N-Gram Improvement Naïve Bayes 93.8 96.0 2.2 SVC 96.1 97.9 1.8 Logistic Regression 93.7 97.8 4.1 Neural Networks 96.1 97.9 1.8 Figure 3. Accuracy improvement 4. CONCLUSION In this work, we performed “Identification of monolingual and code-switch information from English-Kannada code-switch data” and conducted an experiment with various supervised classifiers by using two feature extraction techniques such as word level TF-IDF and character level n-gram. From the proposed
  • 8. Int J Elec & Comp Eng ISSN: 2088-8708  Identification of monolingual and code-switch information from English-Kannada … (Ramesh Chundi) 5639 approach, it is observed that there is a considerable amount of improvement in F1-score and accuracy with character level n-gram for all classifiers in comparison with word level TF-IDF. In terms of F1-score, LR with 3.8% followed by SVC and NN are with 1.7% and NB with 1.6% improvement. In terms of accuracy, LR with 4.1% followed by NB with 2.2%, SVC and NN are with 1.8% improvement. Considering overall performance, SVC and NN are performed better (with 98% of F1-score and 97.9% of accuracy) in comparison with LR (with 97.9% of F1-score and 97.8% of accuracy) and NB (with 96.1% of F1-score and 96% of accuracy). In future, we are planning to develop an automatic language identification system for English-Kannada code- switch data. ACKNOWLEDGEMENTS The authors want to thank the REVA University management for their support of this research activity. REFERENCES [1] J. Lipski, “Code-switching and the problem of bilingual competence,” Aspects of bilingualism, vol. 250, 1978. [2] M. Gysels, “French in urban Lubumbashi Swahili: Codeswitching, borrowing, or both?,” Journal of Multilingual and Multicultural Development, vol. 13, pp. 41–55, Jan. 1992, doi: 10.1080/01434632.1992.9994482. [3] E. McClure, “Duelling languages: grammatical structure in codeswitching,” Studies in Second Language Acquisition, vol. 17, no. 1, pp. 117–118, Mar. 1995, doi: 10.1017/S027226310001408X. [4] M. G. Moyer, “Bilingual speech: A typology of code-mixing,” Language in Society, vol. 31, no. 4, pp. 621–624, Oct. 2002, doi: 10.1017/S004740450224405X. [5] C. Myers-Scotton, Social motivations for codeswitching: Evidence from Africa. Oxford University Press, 1995. [6] R. Nagarajan, “52% of India’s urban youth are now bilingual, 18% speak three languages,” The Times of India, 2018. [7] J. J. Gumperz and J.-P. Blom, “Social meaning in linguistic structures: Code-switching in Norway,” Directions in sociolinguistics, pp. 407–434, 1971. [8] H. Giles and R. N. St. Clair, Language and social psychology. Edition of Language and Social Psychology, 1979. [9] P. Auer, Ed., Code-switching in conversation. Routledge, 2013. [10] B. Ahmed, S.-H. Cha, and C. Tappert, “Language identification from text using n-gram based cumulative frequency addition,” Proceedings of Student/Faculty Research Day, CSIS, Pace University, vol. 12, 2004. [11] M. Rosner and P.-J. Farrugia, “A tagging algorithm for mixed language identification in a noisy domain.” Eighth Annual Conference of the International Speech Communication Association, pp. 190–193, 2007. [12] T. Solorio and Y. Liu, “Learning to predict code-switching points,” Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008, doi: 10.3115/1613715.1613841. [13] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha, “Word-level language identification and predicting codeswitching points in Swahili-English language data,” in Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016, pp. 21–29, doi: 10.18653/v1/W16-5803. [14] Z. Yirmibeşoğlu and G. Eryiğit, “Detecting code-switching between Turkish-English language pair,” in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, 2018, pp. 110–115, doi: 10.18653/v1/W18-6115. [15] U. Barman, A. Das, J. Wagner, and J. Foster, “Code mixing: a challenge for language identification in the language of social media,” in Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 13–23, doi: 10.3115/v1/W14-3902. [16] P. V. Veena, M. A. Kumar, and K. P. Soman, “An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding,” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sep. 2017, pp. 1552–1556, doi: 10.1109/ICACCI.2017.8126062. [17] D. Mave, S. Maharjan, and T. Solorio, “Language identification and analysis of code-switched social media text,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018, pp. 51–61, doi: 10.18653/v1/W18- 3206. [18] S. Gundapu and R. Mamidi, “Word level language identification in English Telugu code mixed data,” arXiv preprint arXiv:2010.04482, 2018. [19] S. Mandal and A. K. Singh, “Language identification in code-mixed data using multichannel neural networks and context capture,” in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, 2018, pp. 116–120, doi: 10.18653/v1/W18-6116. [20] K. Singh, I. Sen, and P. Kumaraguru, “Language identification and named entity recognition in hinglish code mixed tweets,” in Proceedings of ACL 2018, Student Research Workshop, 2018, pp. 52–58, doi: 10.18653/v1/P18-3008. [21] S. D. Das, S. Mandal, and D. Das, “Language identification of Bengali-English code-mixed data using character and phonetic based LSTM models,” in Proceedings of the 11th Forum for Information Retrieval Evaluation, Dec. 2019, pp. 60–64, doi: 10.1145/3368567.3368578. [22] I. Chaitanya, I. Madapakula, S. K. Gupta, and S. Thara, “Word level language identification in code-mixed data using word embedding methods for Indian languages,” in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sep. 2018, pp. 1137–1141, doi: 10.1109/ICACCI.2018.8554501. [23] S. Shekhar, D. K. Sharma, and M. M. Sufyan Beg, “An effective cybernated word embedding system for analysis and language identification in code-mixed social media text,” International Journal of Knowledge-based and Intelligent Engineering Systems, vol. 23, no. 3, pp. 167–179, Oct. 2019, doi: 10.3233/KES-190409. [24] B. S. S. Lakshmi and B. R. Shambhavi, “An automatic language identification system for code-mixed English-Kannada social media text,” in 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec. 2017, pp. 1–5, doi: 10.1109/CSITSS.2017.8447784. [25] J. James et al., “Language models for code-switch detection of te reo Māori and English in a low-resource setting,” in Findings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 650–660, doi: 10.18653/v1/2022.findings-naacl.49.
  • 9.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640 5640 BIOGRAPHIES OF AUTHORS Ramesh Chundi received the B.Sc. degree in computer science and MCA degree from Sri Venkateswara University, India, in 2004 and 2007, respectively. Currently pursuing Ph.D. degree in computer science and Applications from REVA University, India. His research interests include natural language processing (NLP), aritificial intelligence (AI), machine learning, deep learning, data analytics, and data mining. He can be contacted at email: chundiramesh@gmail.com. Vishwanath R. Hulipalled is a Professor in the School of Computing and IT, REVA University, Bangalore, Karnataka, India. He completed BE, ME and Ph.D. in Computer Science and Engineering. His area of Interests includes machine learning, natural language processing, data analytics and time series mining. He has more than 24 years of academic experience and research. He authored more than 50 research articles in reputed journals and conference proceedings. He can be contacted at email: vishwanth.rh@reva.edu.in. Jay Bharthish Simha is the CTO of ABIBA Systems and Chief Mentor at RACE Labs, REVA University. He completed his BE (Mech), M.Tech (Mech), M.Phil.(CS) and Ph.D.(AI). His area of interest includes fuzzy logic, soft computing, machine learning, deep learning, and applications. He has more than 20 years of industrial experience and 4 years of academic experience. He has authored/co-authored more than 50 journal/conference publications. He can be contacted at email: jay.b.simha@reva.edu.in and jay.b.simha@abibasystems.com.