SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 6, December 2020, pp. 5909~5916
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i6.pp5909-5916  5909
Journal homepage: http://ijece.iaescore.com/index.php/IJECE
Improving keyword extraction in multilingual texts
Bahareh Hashemzadeh1
, Majid Abdolrazzagh-Nezhad2
1
Department of Computer and Information Technology, Faculty of Engineering, Torbat-E Heydariyeh University, Iran
2
Department of Computer Engineering, Faculty of Engineering, Bozorgmehr University of Qaenat, Iran
Article Info ABSTRACT
Article history:
Received Jun 18, 2019
Revised Apr 29, 2020
Accepted May 12, 2020
The accuracy of keyword extraction is a leading factor in information
retrieval systems and marketing. In the real world, text is produced in
a variety of languages, and the ability to extract keywords based on
information from different languages improves the accuracy of keyword
extraction. In this paper, the available information of all languages is applied
to improve a traditional keyword extraction algorithm from a multilingual
text. The proposed keywork extraction procedure is an unsupervise algorithm
and designed based on selecting a word as a keyword of a given text, if in
addition to that language holds a high rank based on the keywords criteria in
other languages, as well. To achieve to this aim, the average TF-IDF of
the candidate words were calculated for the same and the other languages.
Then the words with the higher averages TF-IDF were chosen as
the extracted keywords. The obtained results indicat that the algorithms’
accuracis of the multilingual texts in term frequency-inverse document
frequency (TF-IDF) algorithm, graph-based algorithm, and the improved
proposed algorithm are 80, 60.65, and 91.3%, respectively.
Keywords:
Data retrieval
Graph-based algorithm
Keyword extraction
Language independent
Text mining
TF-IDF algorithm
Copyright © 2020 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Majid Abdolrazzagh-Nezhad,
Department of Computer Engineering,
Bozorgmehr University of Qaenat,
9761986844 Bozorgmehr University of Qaenat, Abolmafakher St, Qaen, South Khorasan, Iran.
Email: abdolrazzagh@buqaen.ac.ir
1. INTRODUCTION
Designing data retrieval systems of large databases is one of the research areas for the application of
information technology in the information business. We faced an increasing demand for types of data
retrieval systems able to cross the interlingual boundaries, while text data expands in different languages and
on the web [1-6]. Therefore, by developing the volume of electronic data in various languages, the data
retrieval, independent of document languages, has gained importance. The extraction of effective keywords is
a time-consuming and human-processing task. Recently, automatic keyword extraction, especially keyword
extraction in different languages, introduced an interesting topic for text mining and data retrieval [7-9].
The fields of text mining and information retrieval and especially their implementation on
the database is of particular importance. The first step is to identify and extract keywords from the texts in
the fields. One of the main challenges to extract keywords is existing very diverse languages for contextual
information and depending the available keyword extraction methods on the language’s type and its verbal
structure. The multilingual keywords extraction is the current research problem and the research object is
considered based on designing an unsupervised language-independent algorithm to the extraction. So, it is
done by focusing on the property of repeating keywords in each text and their intensifying in other texts by
utilizing the TF-IDF algorithm.
The rest of the current paper is organized as follows: Section 2 reviews the state-of-the-art keywords
extraction methods. The problem of keywords extraction descrids in Section 3. The proposed language
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916
5910
independent keywords extraction algorithm and its experimental results are discussed in Section 4 and
Section 5. Finally, a conclusion and recommendations are described in section 6.
2. LITERATURE REVIEW
Several methods were proposed so far for the identification and extraction of keywords, all of which
could be classified into two groups of supervised and unsupervised methods [10-12]. In the following,
we discuss shortly about the proposed methods to realize the probable research challenges. The first group is
the supervised methods. In this group, there is a training data set, by learning of which a model is designed
and by incorporating this model on new document the phrases will divided into two classes of key and
non-key phrases.
The supervised method of word extraction is considered as a clustering problem, which should be
trained like a genetic algorithm [13, 14]. In Bayes linear algorithm, which is called a keyphrase extraction
algorithm (KEA) and proposed by [15], TF-IDF and keyphrase relative distance from the beginning of
the text are two algorithm inputs [16]. They also used a binary clustering algorithm that its input features
include some references to the text. Decision tree of [17], conditional random field of [18], and a type of
KEA in [19] are among other types of supervised word extraction. The functionality of this method is highly
dependent on training data and lack of such high quality data could cause an efficiency drop in the system of
keyphrase extraction. In this method, the designed model is specific to a domain and works based on
the domain of usage.
Another approach to extracting keywords is through unsupervised methods. In these methods,
word extraction is dealt with as a ranking issue [20], the most important of which is the TF-IDF. In this
method, the relation between the number of a word repetition within a text is calculated according to
the number of its repletion in other texts [21]. Graph-based methods are also among the unsupervised
methods [22]. The works of [22-24] are examples of graph-based methods for word extraction.
In unsupervised methods, there is no need for training data and the most important contextual phrases could
be extracted by using the ranking strategies. Unlike the supervised methods, the unsupervised methods are
applicable for each text to any domain type independent of domain of usage. By the qualitative analysis and
comparison of the proposed methods several advantages and disadvantages were found, which could be
noted as follows.
The first advantage of the unsupervised methods is their applications in constructing models of any
text type and domain. No efficiency drops in case of existence of poor quality data, independently of training
data, lower time consumption for keyword extraction, compared with the supervised methods,
useful functionality for high-volume data, and high accuracy are among the advantages of the unsupervised
methods. In contrast to these advantages, low compatibility is the most tangible shortcoming of these
methods. As mentioned previously, there are some disadvantages/advantages of the supervised methods,
among which we could refer to the existence of training data with the quality of regular data categorization.
However, one of the significant shortcomings of this method is that it is dependent on the training data and
lack high-quality data could lead to an efficiency drop of the keyword extraction system, the constructed
model is for one domain only, and it acts based on the domain of usage. Providing training data is
a time-consuming and laborious task. Moreover, evaluations which are made based on frequency are not
applying for high-volume data. One of the challenges of such a method is that providing training data is
time-consuming and if such data are not available, the algorithm faces problems and has low efficiency, but it
is not the case in the unsupervised method [1, 3]. Hence, we employ this method for the proposed algorithm.
Despite the simplicity, TF-IDF algorithm is one of the effective methods for keyword
extraction [16, 25]. The practical simplicity and efficiency of this algorithm has attracted a considerable
attention. A logarithm is proposed for word extraction in the present study to improve TF-IDF. This method
is based on TF-IDF, but uses the information of each text in several languages to enhance keyword extraction
based on TF-IDF. To implement such an objective, we concentrated on the repetition of words in the context
and deleted the conjunctions, prepositions, and verbs. Further, we used simultaneous multilingual
information for a certain text, to improve its usage. This process is elucidated in details in the following.
3. PROBLEM DISCRIPTION OF KEYWORDS EXTRACTION
Data retrieval is used extensively in the everyday life of people. Enhancing efficiency and
improving performance is of great importance for the designers of data retrieval systems. As mentioned
previously, one way to increase the productivity of data retrieval systems is through the use of statistical
plans. In these plans, a frequency is set of keywords, based on which words with the highest frequency are
selected as keywords.
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh)
5911
In aim of the present study was to propose an algorithm, which has the required features, including
non-supervisory, language-independent, simplicity, and high speed for processing considerable amount of
data. By using the proposed algorithm along with the TF-IDF, which is a statistical, simple, language
independent and non-supervisory algorithm, by relying on a sequence of calls with Unicode format, and by
designing an online database keyword could be extracted independently of language in large databases.
By assessing the applications of data retrieval and text mining, we could realize that existing
keywords within a text play a significant role and facilitate the process in this field. For example, by finding
important words in the news and by detecting sentences with more important words, we could extract that
sentence in the abstract and better comprehend the text. Since important words are often in headings and
important sections, by realizing the structure of a text and by extracting keywords out of these parts, we could
get access to these words with a minimum of time. Feed or RSS is used for reading news, which make a news
extract available in a structured way by XML format. News reading and saving template are Unicode.
For extracting keywords of news texts, we need websites with proper and authentic feed addresses.
Hence, we select those feeds, which provide appropriate information. These feeds, however, are selected for
every language. After calling information from feeds, they will be saved in a database. Some words are
available with high frequency in all texts with no contextual value, like pronouns, adverbs, prepositions,
conjunctions, and some frequent verbs. These elements are called public words. By omitting the public words
in statistical text mining, we have less calculations and higher efficiency. Words take an equal weight based
on their frequency in the document. Actually, this weighting system shows how much a word is important for
a document. This fact has no functionality in data retrieval. The weight of a word in a text increases by
the number of repetitions in that text, but it is controlled by the number of words in the text. This method is
an unsupervised one, which is applied to a simple text. In contrast to the supervised methods, this method
does not need the training dataset, in that proving an appropriate training data is a time-consuming and not an
easy task and in case the data lack the desired quality, they reduce the efficiency of the supervised keyword
extraction system.
4. THE PROPOSED ALGORITHM
Figure 1 presents the oerall structure of the proposed algorithm in seven steps, which its detail is
discussed as follows:
Figure 1. The overall structure of the proposed algorithm
Step 1 (selecting feeds and retrieval): in order to gain access to various documents of different
languages, we tried to select the appropriate feeds. Data retrieval of each document, like title or
body is carried out in this step. Since our algorithm is language independent, information is read by
the unicode format. Step 2 (saving document information in the large database): the read information is
stored in the database, separately. Data are stored in the Unicode format. This format covers most of
1 -Reading Feed Data
2 -Saving Data In The Database
3 -Word Extraction
4 -Calculating TF-IDF In Every Text And Every Language
5 - Saving The Obtained TF-IDF Calculations
6 - Improving Keyword Extraction Based On TF-IDF
7-depicting Resutls
8-evaluating The Accuracy Rate
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916
5912
the languages. Step 3 (word extraction): all words are extracted from the text and omitted in the step
related to this action. Every language has a list of repetitive words, which should be deleted from
the extracted words.
Step 4 (TF-IDF calculations): TF-IDF calculations are carried out in the step for every text and
language and finally the calculated TF-IDF of each text in a different language is used for improving
the keywords. In this method, each word has a frequency-based weight in the document. Actually,
such weighting system shows how much a word is important for a particular document. This process is used
frequently in data retrieval. The weight of a word is increased by the increase of its repetition in a certain
text, but is controlled by the number of words in the context, because if the text is lengthy some words would
be repeated, naturally, though they do not have any significance in the meaning. Term frequency is a criterion
for the range of common and repetitive words in a text, which is calculated as follows:
𝑇𝐹(𝑓, 𝑑) = 0.5 + 0.5 ×
𝑓(𝑡,𝑑)
𝑚𝑎𝑥{𝑓(𝑤,𝑑):𝑤∈𝑑}
(1)
where in the numerator, d is the number of words in the selected text. w is the most frequent words in
the selected text.
IDF (inverse document Frequency) is a criterion for the range of the most frequent and repetitive
words. This criterion is achieved by dividing the total number of texts in the number of texts including
the common word. For example: suppose that there are 1000 texts in the whole databases. If there is a certain
word in all of them (like, is) the result of an algorithm is 1000 divided by 1000, which is 0, that is, this word
is among the common words and must be taken the coefficient of 0. However, if the repetition is occurred in
500 texts, the result is 1 and takes the coefficient of 1. The more the repetition of a word, the less is the IDF
weight. In case a word has no repetition and dominator becomes 0, we put +1 in dominator, which is
calculated through second formula:
𝐼𝐷𝐹(𝑡, 𝐷) = 𝑙𝑜𝑔⁡(
𝐷
1+{𝑑∈𝐷:𝑡∈𝑑}
) (2)
where, D is the number of existing texts in the numerator and the number of texts bearing the word in
the dominator. The TF-IDF is calculated through formula (3) as follows:
𝑇𝐹_𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹(𝑡, 𝐷) (3)
Step 5 (saving calculations in the database): the performed calculations are saved in the database by
TF-IDF algorithm. Step 6 (improving the extraction of the proposed TF-IDF): in the conventional TF-IDF,
in a text in a certain language, words with the highest frequency of TF-IDF are considered as keywords in
that text with the same language. However, in the proposed method, words are called keywords if their
averages TF-IDF are high for that text with the same language and other languages. Therefore, the average
TF-IDF is considered for a text with the same language and other languages and instead of using TF-IDF of
a text in a language, its average TF-IDF is used in available languages. This simple, but useful method could
improve the extraction of keywords, significantly. In this paper, average and maximum TF-IDF method for
a text in different languages is also tested, the result of which outweighed the conventional one. However,
the method, which calculates the average TF-IDF has the highest accuracy.
Step 7 (depicting results): this step shows those keywords, which were extracted by TF-IDF
improved algorithm. Step 8 (evaluating the accuracy rate): in this step, the keyword extraction accuracy of
the algorithm is calculated through the following formula:
⁡𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦⁡𝑟𝑎𝑡𝑒 =⁡
𝑁𝑜.𝑜𝑓⁡𝑐𝑜𝑟𝑟𝑒𝑐𝑡⁡𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑⁡𝑤𝑜𝑟𝑑𝑠⁡
𝑡𝑜𝑡𝑎𝑙⁡𝑛𝑜.𝑜𝑓⁡𝑤𝑜𝑟𝑑𝑠⁡𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑⁡𝑎𝑠⁡𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠
× 100 (4)
where, the number of correct extracted keywords are those words, which are common between actual
keywords and the extracted one by the algorithm. The dominator is also the total number of extracted words
by the algorithm as a keyword.
The pseudo-code of the proposed keywords extraction algorithm is presented in Figure 2.
The algorithm is unsupervised and could be run on the simple text. It means that unlike supervised keyword
extraction algorithms, there is no need for appropriate training data sets. As known as, providing appropriate
training data is time consuming and difficult. If the data is not of good quality, it will lead to a decline in
the efficiency of the supervised keyword extraction algorithms.
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh)
5913
1. Begin
2. If data have ASCI format, change them to Unicode format.
3. Read information from feeds with Unicode by Get RSS data function.
4. Store the information in a database.
5. Generate Ignore array based on prepositions, conjunctions, adverbs and verbs.
6. Read all words by GetWord function and save them in Key_word array.
7. Remove the words of Ignore array from Key_word array.
8. Calculate Equ. (3) for Key_word array by running TF-IDF algorithm.
9. Calculate the average of TF-IDF for Key_word array in different languages.
10. Save any words of Key_word array as keywords if their averages TF-IDF are high for the same and the other languages.
11. Calculate Equ. (4) for the identified keywords.
12. End
Figure 2. The pseudo-code of the proposed algorithm
5. EXPERIMENTAL RESULTS
The proposed algorithm was programed in SQL Server 2012 and Visual Studio 2013 and
simulations were performed on the Intel Core i5, 64 B, CPU 2.50 GHz and RAM 21 GB. The database used
for evaluating the efficiency and performance of the proposed keyword extraction algorithm has been an
online dataset containing 200 news collected from BBC website in various languages. Each news is in eight
languages. The reason for using such a dataset was to provide updated information, which are processed at
the same time. The proposed method is assessed by counting the number of matching between extracted
keywords by the proposed method and given keywords.
5.1. The results of the proposed algorithm
An algorithm is designed in this study, which is language independent and has a simple structure.
In contract to language-dependent algorithms (like [26]), which are using the Persian roots for keyword
extraction, this algorithm is simply functional for large databases in every language. In the TF-IDF algorithm,
high-frequency words in a text, but in all languages (TF-IDF mean in all languages) were selected as
keywords and the accuracy of the algorithm, considering the text in various languages is improved.
It is noteworthy that in a text, non-keywords, including verbs and prepositions are repeated, considerably,
so, we set all non-keywords a side at the very beginning. The proposed algorithm is applicable to all
multilingual websites and here the results were shown just on BBC News Website. The database used is
comprised of 200 news collected from BBC Website in eight languages (a total of 1600 news). As can be
seen in Table 1, words with relatively high TF-IDF (here TF-IDF more than 20) were considered, while in
the conventional TF-IDF algorithm, in every language, those independent words with highest TF-IDF value
is counted as keywords. As can be seen in Table 2, in Persian language, the word “America” is detected as
a keyword (in thickened Table 2 mistakenly, while in English language, three words of “America, England,
and London” (in thickened Table 2 were mistakenly detected as a keyword. In other languages, two or three
keywords were also known as keywords, mistakenly.
Table 3 illustrates the proposed algorithm results for the selected text. As can be seen in the table,
the mean TF-IDF is calculated in eight languages (the proposed algorithm) for each word depicted in Table 1
and seven keywords were selected. The selected keywords in this method are considered for all eight
languages, such that for all languages in this text, keywords in the mean method, which are shown in Table 3
include Quds, Zionist, America, demonstration, people, Palestine, and Iran, in which America is detected
mistakenly as a keyword for all languages. However, as we mentioned in Table 2, in the conventional
TF-IDF method the number of wrong detected keywords is different and more than one word for most
languages. If we evaluate the accuracy of mean TF-IDF algorithm (the proposed one) and that of
the conventional algorithm, the conventional algorithm (which is shown in Table 2) 6 of 7 Persian words,
4 of 7 English words, and 3 of 7 Arabic words, as well as other words in other languages were detected,
correctly. In total, in 8 languages and among 56-7*8 correct keywords, 39 were detected correctly and
the accuracy of the algorithm is 0.69=39/56, while in the mean TF-IDF method, 6 of 7 words were detected
correctly for all languages and the accuracy of the algorithm is 0.85=6/7. This is the case of the mean and
maximum method.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916
5914
Table 1. The TF-IDF value for words that are most likely to be among the key words for the selected text
Table 2. The results of typical TFIDF algorithms, thick words are mistaken for keywords
Table 3. Results of the proposed algorithm, TFIDF improved thick words are words
that are mistakenly identified as keywords
Maximum Method (Selected
Keywords for Any 8 Languages)
Medium Method (Selected
Keywords for All 8 Languages)
Average method (Selected
keywords for all 8 languages)
Right
keywords
50 March 50 March 47 Ghods March
47 Ghods 47 Ghods 40.12 Zionist Ghods
43 Zionist 43 Zionist 38.87 America Global
43 America 43 America 36.87 March Zionist
40 Palestine 38 People 29.25 People Iran
29 Terrorist 23.5 Palestine 25.87 Palestine People
38 People 21 Iran 22 Iran Palestine
5.2. The comparison of the obtained results with the other related algorithms
To evaluate the efficiency and performance, the rate of accuracy of the proposed algorithm is
compared with that of the other methods. The algorithm was tested with 200 texts in eight languages,
which are shown in Table 1, and 1200 correct keywords were achieved. The rate of accuracy of
the conventional TF-IDF algorithm for 1014 correct words and 1672 obtained keywords is 60.6%,
while the proposed algorithm, namely the mean. TF-IDF, for 1164 correct words of 1275 words, the rate is
91.3%. In the proposed algorithm with the median method, 1092 correct words of 1456 words indicate
the rate of 75%. Moreover, if we calculate the accuracy rate for the maximum method, 1021 correct words of
1531 words by the accuracy rate of 66.6% is obtained. The rate of accuracy for graph-based algorithm [27]
for these data is 80%. Concerning the obtained rates, mean with the accuracy rate of 91.3% is the best
method. Table 4 shows the summary of results on BBC data. This suggests that the proposed algorithm not
only extracted the keywords language independent, but has achieved a considerably better results. Table 5
shows comparison the algorithm with other related algorithms.
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh)
5915
Table 4. Suggested algorithm accuracy rates and keywords extraction algorithms on BBC data
Algorithm
TF-IDF Maximum
suggestion
TF-IDF Suggestion
middleware
TF-IDF Suggested
average
Graph
[27]
TF-IDF
normal
Accuracy rate 66.6% 75% 91.3% 80% 60.6%
Table 5. Comparing the algorithm with other related algorithms
Algoritm Accuracy
The Proposed Algorithm 91.3%
Graph[27] 80%
Kp[28] 47.7%
MSF [29] 60%
GATE[30] 64.4%
Habibi[1] 75%
Single-Document[31] 83.2%
6. CONCLUSION
Data retrieval is widely applied in everyday life. Increasing the efficiency and performance of
information retrieval systems is very important for their designers. We realized based on investigating
the applications of the data retrieval and text mining that the keywords of a text are important and facilitate
the oriantations of the processes. For example, by finding the keywords in the news or some sentences with
more keywords, we could summarize or comprehend the text more easily. To achieve to this aim,
an unsupervised keywords extraction algorithm is proposed based on improving the TF-IDF algorithm for
multi-language texts. In the proposed algorithm, the average TF-IDF of the candidate words were calculated
for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as
the extracted keywords. A database, which was collected 200 news from BBC website in various languages,
was considered to evaluate the efficiency of the proposed algorithm. The experimental results show that
the selected keywords are more similar to the mentioned keywords by the website and this confirms
the reliability of the algorithm. The overall accuracy rate of the algorithm is 91.3% that it is higher than
the state-of-the-art keyword extraction algorithms. We would like to introduce three strategies as our future
works, to improve the proposed algorithm in application, complexity and time. Finding complex keywords
could be added to the algorithm, real-time and on-line behaviour could be created by focusing on parallel
processing and normalizing the feeds’ addresses could be considered to facilitate access.
REFERENCES
[1] M. Habibi and A. Popescu-Belis, “Keyword extraction and clustering for document recommendation in
conversations,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 23, no. 4, pp. 746-759,
2015.
[2] M. Savić, et al., “A language-independent approach to the extraction of dependencies between source code
entities,” Information and Software Technology, vol. 56, no. 10, pp. 1268-1288, 2014.
[3] S. Siddiqi and A. Sharan, “Keyword and keyphrase extraction techniques: a literature review,” International
Journal of Computer Applications, vol. 109, no. 2, pp. 18-23, 2015.
[4] T. S. Chung, et al., “A survey of flash translation layer,” Journal of Systems Architecture, vol. 55, no. 5-6,
pp. 332-343, 2009.
[5] N. I. Abdulkhaleq, et al., “Improving the data recovery for short length LT codes,” International Journal of
Electrical & Computer Engineering, vol. 10, no. 2, pp. 1972-1979, 2020.
[6] N. N. Kulkarni and S. A. Jain, “Checking integrity of data and recovery in the cloud environment,” Indonesian
Journal of Electrical Engineering and Computer Science (IJEECS), vol. 13, no. 2, pp. 626-633, 2019.
[7] E. Cambria and B. White, “Jumping NLP curves: A review of natural language processing research,” IEEE
Computational intelligence magazine, vol. 9, no. 2, pp. 48-57, 2014.
[8] V. Jain and S. V. A. V. Prasad, “Ontology based information retrieval model in semantic web: a review,”
International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 8,
pp. 837-842, 2014.
[9] K. Kim, et al., “Language independent semantic kernels for short-text classification,” Expert Systems with
Applications, vol. 41, no. 2, pp. 735-743, 2014.
[10] D. Deshwal, et al., “Feature Extraction Methods in Language Identification: A Survey,” Wireless Personal
Communications, vol. 107, pp. 2071-2103, 2019.
[11] S. K. Bharti and K. S. Babu, “Automatic keyword extraction for text summarization: A survey,” arXiv preprint
arXiv:1704.03242, 2017.
[12] E. Ferrara, et al., “Web data extraction, applications and techniques: A survey,” Knowledge-based systems, vol. 70,
pp. 301-323, 2014.
[13] P. Turney, “Learning to extract keyphrases from text,” National Research Council Canada, 2002.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916
5916
[14] S. S. Hong, et al., “The feature selection method based on genetic algorithm for efficient of text clustering and
text classification,” International Journal of Advances in Soft Computing and its Applications, vol. 7, no. 1,
pp. 22-40, 2015.
[15] E. Frank, et al., “Domain-specific keyphrase extraction,” in 16th International joint conference on artificial
intelligence (IJCAI 99), vol. 2, pp. 668-673, 1999.
[16] C. Caragea, et al., “Citation-enhanced keyphrase extraction from research papers: A supervised approach,”
in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1435-1446, 2014.
[17] G. Ercan and I. Cicekli, “Using lexical chains for keyword extraction,” Information Processing & Management,
vol. 43, no. 6, pp. 1705-1714, 2007.
[18] F. Fkih and M. N. Omri, “Complex terminology extraction model from unstructured web text based linguistic and
statistical knowledge,” International Journal of Information Retrieval Research, vol. 2, no. 3, pp. 1-18, 2013.
[19] G. Figueroa, et al., “RankUp: Enhancing graph-based keyphrase extraction methods with error-feedback
propagation,” Computer Speech & Language, vol. 47, pp. 112-131, 2018.
[20] S. Lahiri, et al., “Keyword and keyphrase extraction using centrality measures on collocation networks,” arXiv
preprint arXiv:1401.6571, 2014.
[21] P. Tonella, et al., “Using keyword extraction for web site clustering,” in Fifth IEEE International Workshop on
Web Site Evolution, 2003. Theme: Architecture. Proceedings, pp. 41-48, 2003.
[22] S. K. Biswas, et al., “A graph based keyword extraction model using collective node weight,” Expert Systems with
Applications, vol. 97, pp. 51-59, 2018.
[23] R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 conference on empirical
methods in natural language processing, pp. 404-411, 2004.
[24] S. Duari and V. Bhatnagar, “sCAKE: Semantic Connectivity Aware Keyword Extraction,” Journal of Information
Sciences, vol. 477, pp. 100-117, 2019.
[25] K. S. Hasan and V. Ng, “Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art,”
in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 365-373, 2010.
[26] R. Farhad, et al., “Improved Clustering Persian Text Based on Keyword Using Linguistic and Thesaurus
Knowledge,” Signal and Data Processing, vol. 13, no. 1, pp. 87-100, 2016.
[27] A. R. Nabhan and K. Shaalan, “Keyword identification using text graphlet patterns,” in International Conference
on Applications of Natural Language to Information Systems, pp. 152-161, 2016.
[28] C. B. Ali, et al., “A two-level keyphrase extraction approach,” in International Conference on Intelligent Text
Processing and Computational Linguistics, pp. 390-401, 2015.
[29] D. Y. Lee, et al., “A New Extraction Algorithm for Hierarchical Keyword Using Text Social Network,”
in Information Science and Applications (ICISA) 2016, pp. 903-912, 2016.
[30] P. Nesi, et al., “A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and
Documents,” in 21st International Conference on Distributed Multimedia Systems (DMS 2015), pp. 1-7, 2015.
[31] F. Rousseau and M. Vazirgiannis, “Main core retention on graph-of-words for single-document keyword
extraction,” in European Conference on Information Retrieval, pp. 382-393, 2015.
BIOGRAPHIES OF AUTHORS
Bahare Hashemzade is lecturing and researching at Electrical and Computer Engineering,
University of Torbat Heydarieh from 2016. She has graduated at M.Sc. of Information Science
from Birjand Uniersity in 2015. Her interest fields are information technology, obfuscation and
data mining.
Majid Abdolrazzagh-Nezhad is lecturing and researching as assistant professor at the computer
engineering department of the University of Bozorgmehr Qaenat since 2013 and dean
the department since 2016. He was dean of the faculty of computer science between 2013 to
2016. Abdolrazzagh-Nezhad Supports Master and PhD students of Islamic Azad University of
Birjand since 2016. He has graduated at PhD in Computer Science from Information Science and
Technology, Faculty of the National University of Malaysia (UKM) in 2013. Also, he received
his master degree of Operation Research from University of Sistan and Blochstan in 2007, and
his bachelor’s degree from University of Birjand in 2004. His interest fields are artificial
intelligent, optimization, data mining, scheduling and uncertain systems. He is a young
Professionals member of the Institute of Electrical and Electronics Engineering (IEEE) and
reviewer of valid journals such as Information Sciences, Applied Soft Computing, Soft
Computing, International Journal of Production Research, IEEE Transaction on Industrial
Electronics and IEEE Transaction on Industrial Informatics.

More Related Content

PDF
06522405
PDF
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
PDF
Clustering Students of Computer in Terms of Level of Programming
PDF
Predicting students' performance using id3 and c4.5 classification algorithms
PDF
Data Analysis and Result Computation (DARC) Algorithm for Tertiary Institutions
PDF
Investigation of Attitudes Towards Computer Programming in Terms of Various V...
PDF
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
PDF
WEB-BASED DATA MINING TOOLS : PERFORMING FEEDBACK ANALYSIS AND ASSOCIATION RU...
06522405
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
Clustering Students of Computer in Terms of Level of Programming
Predicting students' performance using id3 and c4.5 classification algorithms
Data Analysis and Result Computation (DARC) Algorithm for Tertiary Institutions
Investigation of Attitudes Towards Computer Programming in Terms of Various V...
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
WEB-BASED DATA MINING TOOLS : PERFORMING FEEDBACK ANALYSIS AND ASSOCIATION RU...

What's hot (17)

PDF
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
PDF
AN ENHANCED ELECTRONIC TRANSCRIPT SYSTEM (E-ETS)
PDF
Readiness measurement of IT implementation in Higher Education Institutions i...
PDF
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
PDF
Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectiv...
PDF
F03403031040
PDF
Libyan Students' Academic Performance and Ranking in Nursing Informatics - Da...
PDF
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PDF
An Evaluation of Preprocessing Techniques for Text Classification
PDF
Data Mining Application in Advertisement Management of Higher Educational Ins...
PDF
TOBRUK UNIVERSITY GRADING SYSTEM FOR COLLEGE OF NURSING VERSION 2 IN TOBRUK, ...
PDF
L016136369
PDF
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
PDF
Correlation based feature selection (cfs) technique to predict student perfro...
PDF
02 20274 improved ichi square...
PDF
Ijetcas14 368
PDF
C017510717
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
AN ENHANCED ELECTRONIC TRANSCRIPT SYSTEM (E-ETS)
Readiness measurement of IT implementation in Higher Education Institutions i...
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectiv...
F03403031040
Libyan Students' Academic Performance and Ranking in Nursing Informatics - Da...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
An Evaluation of Preprocessing Techniques for Text Classification
Data Mining Application in Advertisement Management of Higher Educational Ins...
TOBRUK UNIVERSITY GRADING SYSTEM FOR COLLEGE OF NURSING VERSION 2 IN TOBRUK, ...
L016136369
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
Correlation based feature selection (cfs) technique to predict student perfro...
02 20274 improved ichi square...
Ijetcas14 368
C017510717
Ad

Similar to Improving keyword extraction in multilingual texts (20)

PDF
Machine learning for text document classification-efficient classification ap...
PDF
2. an efficient approach for web query preprocessing edit sat
PDF
Building a recommendation system based on the job offers extracted from the w...
PDF
Mining Social Media Data for Understanding Drugs Usage
PDF
IRJET- Determining Document Relevance using Keyword Extraction
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
PDF
H04564550
PDF
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
PDF
Using data mining methods knowledge discovery for text mining
PDF
Developing a framework for
PDF
Data mining for prediction of human
PDF
Development of Information Extraction for Data Analysis using NLP
PDF
Extraction and Retrieval of Web based Content in Web Engineering
PDF
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
PDF
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
PDF
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
PDF
An in-depth review on News Classification through NLP
PDF
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
PDF
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Machine learning for text document classification-efficient classification ap...
2. an efficient approach for web query preprocessing edit sat
Building a recommendation system based on the job offers extracted from the w...
Mining Social Media Data for Understanding Drugs Usage
IRJET- Determining Document Relevance using Keyword Extraction
Performance Evaluation of Query Processing Techniques in Information Retrieval
H04564550
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Using data mining methods knowledge discovery for text mining
Developing a framework for
Data mining for prediction of human
Development of Information Extraction for Data Analysis using NLP
Extraction and Retrieval of Web based Content in Web Engineering
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An in-depth review on News Classification through NLP
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
composite construction of structures.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Well-logging-methods_new................
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Digital Logic Computer Design lecture notes
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
composite construction of structures.pdf
CH1 Production IntroductoryConcepts.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Lecture Notes Electrical Wiring System Components
Foundation to blockchain - A guide to Blockchain Tech
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Well-logging-methods_new................
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Digital Logic Computer Design lecture notes
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx

Improving keyword extraction in multilingual texts

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 10, No. 6, December 2020, pp. 5909~5916 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i6.pp5909-5916  5909 Journal homepage: http://ijece.iaescore.com/index.php/IJECE Improving keyword extraction in multilingual texts Bahareh Hashemzadeh1 , Majid Abdolrazzagh-Nezhad2 1 Department of Computer and Information Technology, Faculty of Engineering, Torbat-E Heydariyeh University, Iran 2 Department of Computer Engineering, Faculty of Engineering, Bozorgmehr University of Qaenat, Iran Article Info ABSTRACT Article history: Received Jun 18, 2019 Revised Apr 29, 2020 Accepted May 12, 2020 The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80, 60.65, and 91.3%, respectively. Keywords: Data retrieval Graph-based algorithm Keyword extraction Language independent Text mining TF-IDF algorithm Copyright © 2020 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Majid Abdolrazzagh-Nezhad, Department of Computer Engineering, Bozorgmehr University of Qaenat, 9761986844 Bozorgmehr University of Qaenat, Abolmafakher St, Qaen, South Khorasan, Iran. Email: abdolrazzagh@buqaen.ac.ir 1. INTRODUCTION Designing data retrieval systems of large databases is one of the research areas for the application of information technology in the information business. We faced an increasing demand for types of data retrieval systems able to cross the interlingual boundaries, while text data expands in different languages and on the web [1-6]. Therefore, by developing the volume of electronic data in various languages, the data retrieval, independent of document languages, has gained importance. The extraction of effective keywords is a time-consuming and human-processing task. Recently, automatic keyword extraction, especially keyword extraction in different languages, introduced an interesting topic for text mining and data retrieval [7-9]. The fields of text mining and information retrieval and especially their implementation on the database is of particular importance. The first step is to identify and extract keywords from the texts in the fields. One of the main challenges to extract keywords is existing very diverse languages for contextual information and depending the available keyword extraction methods on the language’s type and its verbal structure. The multilingual keywords extraction is the current research problem and the research object is considered based on designing an unsupervised language-independent algorithm to the extraction. So, it is done by focusing on the property of repeating keywords in each text and their intensifying in other texts by utilizing the TF-IDF algorithm. The rest of the current paper is organized as follows: Section 2 reviews the state-of-the-art keywords extraction methods. The problem of keywords extraction descrids in Section 3. The proposed language
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916 5910 independent keywords extraction algorithm and its experimental results are discussed in Section 4 and Section 5. Finally, a conclusion and recommendations are described in section 6. 2. LITERATURE REVIEW Several methods were proposed so far for the identification and extraction of keywords, all of which could be classified into two groups of supervised and unsupervised methods [10-12]. In the following, we discuss shortly about the proposed methods to realize the probable research challenges. The first group is the supervised methods. In this group, there is a training data set, by learning of which a model is designed and by incorporating this model on new document the phrases will divided into two classes of key and non-key phrases. The supervised method of word extraction is considered as a clustering problem, which should be trained like a genetic algorithm [13, 14]. In Bayes linear algorithm, which is called a keyphrase extraction algorithm (KEA) and proposed by [15], TF-IDF and keyphrase relative distance from the beginning of the text are two algorithm inputs [16]. They also used a binary clustering algorithm that its input features include some references to the text. Decision tree of [17], conditional random field of [18], and a type of KEA in [19] are among other types of supervised word extraction. The functionality of this method is highly dependent on training data and lack of such high quality data could cause an efficiency drop in the system of keyphrase extraction. In this method, the designed model is specific to a domain and works based on the domain of usage. Another approach to extracting keywords is through unsupervised methods. In these methods, word extraction is dealt with as a ranking issue [20], the most important of which is the TF-IDF. In this method, the relation between the number of a word repetition within a text is calculated according to the number of its repletion in other texts [21]. Graph-based methods are also among the unsupervised methods [22]. The works of [22-24] are examples of graph-based methods for word extraction. In unsupervised methods, there is no need for training data and the most important contextual phrases could be extracted by using the ranking strategies. Unlike the supervised methods, the unsupervised methods are applicable for each text to any domain type independent of domain of usage. By the qualitative analysis and comparison of the proposed methods several advantages and disadvantages were found, which could be noted as follows. The first advantage of the unsupervised methods is their applications in constructing models of any text type and domain. No efficiency drops in case of existence of poor quality data, independently of training data, lower time consumption for keyword extraction, compared with the supervised methods, useful functionality for high-volume data, and high accuracy are among the advantages of the unsupervised methods. In contrast to these advantages, low compatibility is the most tangible shortcoming of these methods. As mentioned previously, there are some disadvantages/advantages of the supervised methods, among which we could refer to the existence of training data with the quality of regular data categorization. However, one of the significant shortcomings of this method is that it is dependent on the training data and lack high-quality data could lead to an efficiency drop of the keyword extraction system, the constructed model is for one domain only, and it acts based on the domain of usage. Providing training data is a time-consuming and laborious task. Moreover, evaluations which are made based on frequency are not applying for high-volume data. One of the challenges of such a method is that providing training data is time-consuming and if such data are not available, the algorithm faces problems and has low efficiency, but it is not the case in the unsupervised method [1, 3]. Hence, we employ this method for the proposed algorithm. Despite the simplicity, TF-IDF algorithm is one of the effective methods for keyword extraction [16, 25]. The practical simplicity and efficiency of this algorithm has attracted a considerable attention. A logarithm is proposed for word extraction in the present study to improve TF-IDF. This method is based on TF-IDF, but uses the information of each text in several languages to enhance keyword extraction based on TF-IDF. To implement such an objective, we concentrated on the repetition of words in the context and deleted the conjunctions, prepositions, and verbs. Further, we used simultaneous multilingual information for a certain text, to improve its usage. This process is elucidated in details in the following. 3. PROBLEM DISCRIPTION OF KEYWORDS EXTRACTION Data retrieval is used extensively in the everyday life of people. Enhancing efficiency and improving performance is of great importance for the designers of data retrieval systems. As mentioned previously, one way to increase the productivity of data retrieval systems is through the use of statistical plans. In these plans, a frequency is set of keywords, based on which words with the highest frequency are selected as keywords.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh) 5911 In aim of the present study was to propose an algorithm, which has the required features, including non-supervisory, language-independent, simplicity, and high speed for processing considerable amount of data. By using the proposed algorithm along with the TF-IDF, which is a statistical, simple, language independent and non-supervisory algorithm, by relying on a sequence of calls with Unicode format, and by designing an online database keyword could be extracted independently of language in large databases. By assessing the applications of data retrieval and text mining, we could realize that existing keywords within a text play a significant role and facilitate the process in this field. For example, by finding important words in the news and by detecting sentences with more important words, we could extract that sentence in the abstract and better comprehend the text. Since important words are often in headings and important sections, by realizing the structure of a text and by extracting keywords out of these parts, we could get access to these words with a minimum of time. Feed or RSS is used for reading news, which make a news extract available in a structured way by XML format. News reading and saving template are Unicode. For extracting keywords of news texts, we need websites with proper and authentic feed addresses. Hence, we select those feeds, which provide appropriate information. These feeds, however, are selected for every language. After calling information from feeds, they will be saved in a database. Some words are available with high frequency in all texts with no contextual value, like pronouns, adverbs, prepositions, conjunctions, and some frequent verbs. These elements are called public words. By omitting the public words in statistical text mining, we have less calculations and higher efficiency. Words take an equal weight based on their frequency in the document. Actually, this weighting system shows how much a word is important for a document. This fact has no functionality in data retrieval. The weight of a word in a text increases by the number of repetitions in that text, but it is controlled by the number of words in the text. This method is an unsupervised one, which is applied to a simple text. In contrast to the supervised methods, this method does not need the training dataset, in that proving an appropriate training data is a time-consuming and not an easy task and in case the data lack the desired quality, they reduce the efficiency of the supervised keyword extraction system. 4. THE PROPOSED ALGORITHM Figure 1 presents the oerall structure of the proposed algorithm in seven steps, which its detail is discussed as follows: Figure 1. The overall structure of the proposed algorithm Step 1 (selecting feeds and retrieval): in order to gain access to various documents of different languages, we tried to select the appropriate feeds. Data retrieval of each document, like title or body is carried out in this step. Since our algorithm is language independent, information is read by the unicode format. Step 2 (saving document information in the large database): the read information is stored in the database, separately. Data are stored in the Unicode format. This format covers most of 1 -Reading Feed Data 2 -Saving Data In The Database 3 -Word Extraction 4 -Calculating TF-IDF In Every Text And Every Language 5 - Saving The Obtained TF-IDF Calculations 6 - Improving Keyword Extraction Based On TF-IDF 7-depicting Resutls 8-evaluating The Accuracy Rate
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916 5912 the languages. Step 3 (word extraction): all words are extracted from the text and omitted in the step related to this action. Every language has a list of repetitive words, which should be deleted from the extracted words. Step 4 (TF-IDF calculations): TF-IDF calculations are carried out in the step for every text and language and finally the calculated TF-IDF of each text in a different language is used for improving the keywords. In this method, each word has a frequency-based weight in the document. Actually, such weighting system shows how much a word is important for a particular document. This process is used frequently in data retrieval. The weight of a word is increased by the increase of its repetition in a certain text, but is controlled by the number of words in the context, because if the text is lengthy some words would be repeated, naturally, though they do not have any significance in the meaning. Term frequency is a criterion for the range of common and repetitive words in a text, which is calculated as follows: 𝑇𝐹(𝑓, 𝑑) = 0.5 + 0.5 × 𝑓(𝑡,𝑑) 𝑚𝑎𝑥{𝑓(𝑤,𝑑):𝑤∈𝑑} (1) where in the numerator, d is the number of words in the selected text. w is the most frequent words in the selected text. IDF (inverse document Frequency) is a criterion for the range of the most frequent and repetitive words. This criterion is achieved by dividing the total number of texts in the number of texts including the common word. For example: suppose that there are 1000 texts in the whole databases. If there is a certain word in all of them (like, is) the result of an algorithm is 1000 divided by 1000, which is 0, that is, this word is among the common words and must be taken the coefficient of 0. However, if the repetition is occurred in 500 texts, the result is 1 and takes the coefficient of 1. The more the repetition of a word, the less is the IDF weight. In case a word has no repetition and dominator becomes 0, we put +1 in dominator, which is calculated through second formula: 𝐼𝐷𝐹(𝑡, 𝐷) = 𝑙𝑜𝑔⁡( 𝐷 1+{𝑑∈𝐷:𝑡∈𝑑} ) (2) where, D is the number of existing texts in the numerator and the number of texts bearing the word in the dominator. The TF-IDF is calculated through formula (3) as follows: 𝑇𝐹_𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹(𝑡, 𝐷) (3) Step 5 (saving calculations in the database): the performed calculations are saved in the database by TF-IDF algorithm. Step 6 (improving the extraction of the proposed TF-IDF): in the conventional TF-IDF, in a text in a certain language, words with the highest frequency of TF-IDF are considered as keywords in that text with the same language. However, in the proposed method, words are called keywords if their averages TF-IDF are high for that text with the same language and other languages. Therefore, the average TF-IDF is considered for a text with the same language and other languages and instead of using TF-IDF of a text in a language, its average TF-IDF is used in available languages. This simple, but useful method could improve the extraction of keywords, significantly. In this paper, average and maximum TF-IDF method for a text in different languages is also tested, the result of which outweighed the conventional one. However, the method, which calculates the average TF-IDF has the highest accuracy. Step 7 (depicting results): this step shows those keywords, which were extracted by TF-IDF improved algorithm. Step 8 (evaluating the accuracy rate): in this step, the keyword extraction accuracy of the algorithm is calculated through the following formula: ⁡𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦⁡𝑟𝑎𝑡𝑒 =⁡ 𝑁𝑜.𝑜𝑓⁡𝑐𝑜𝑟𝑟𝑒𝑐𝑡⁡𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑⁡𝑤𝑜𝑟𝑑𝑠⁡ 𝑡𝑜𝑡𝑎𝑙⁡𝑛𝑜.𝑜𝑓⁡𝑤𝑜𝑟𝑑𝑠⁡𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑⁡𝑎𝑠⁡𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠 × 100 (4) where, the number of correct extracted keywords are those words, which are common between actual keywords and the extracted one by the algorithm. The dominator is also the total number of extracted words by the algorithm as a keyword. The pseudo-code of the proposed keywords extraction algorithm is presented in Figure 2. The algorithm is unsupervised and could be run on the simple text. It means that unlike supervised keyword extraction algorithms, there is no need for appropriate training data sets. As known as, providing appropriate training data is time consuming and difficult. If the data is not of good quality, it will lead to a decline in the efficiency of the supervised keyword extraction algorithms.
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh) 5913 1. Begin 2. If data have ASCI format, change them to Unicode format. 3. Read information from feeds with Unicode by Get RSS data function. 4. Store the information in a database. 5. Generate Ignore array based on prepositions, conjunctions, adverbs and verbs. 6. Read all words by GetWord function and save them in Key_word array. 7. Remove the words of Ignore array from Key_word array. 8. Calculate Equ. (3) for Key_word array by running TF-IDF algorithm. 9. Calculate the average of TF-IDF for Key_word array in different languages. 10. Save any words of Key_word array as keywords if their averages TF-IDF are high for the same and the other languages. 11. Calculate Equ. (4) for the identified keywords. 12. End Figure 2. The pseudo-code of the proposed algorithm 5. EXPERIMENTAL RESULTS The proposed algorithm was programed in SQL Server 2012 and Visual Studio 2013 and simulations were performed on the Intel Core i5, 64 B, CPU 2.50 GHz and RAM 21 GB. The database used for evaluating the efficiency and performance of the proposed keyword extraction algorithm has been an online dataset containing 200 news collected from BBC website in various languages. Each news is in eight languages. The reason for using such a dataset was to provide updated information, which are processed at the same time. The proposed method is assessed by counting the number of matching between extracted keywords by the proposed method and given keywords. 5.1. The results of the proposed algorithm An algorithm is designed in this study, which is language independent and has a simple structure. In contract to language-dependent algorithms (like [26]), which are using the Persian roots for keyword extraction, this algorithm is simply functional for large databases in every language. In the TF-IDF algorithm, high-frequency words in a text, but in all languages (TF-IDF mean in all languages) were selected as keywords and the accuracy of the algorithm, considering the text in various languages is improved. It is noteworthy that in a text, non-keywords, including verbs and prepositions are repeated, considerably, so, we set all non-keywords a side at the very beginning. The proposed algorithm is applicable to all multilingual websites and here the results were shown just on BBC News Website. The database used is comprised of 200 news collected from BBC Website in eight languages (a total of 1600 news). As can be seen in Table 1, words with relatively high TF-IDF (here TF-IDF more than 20) were considered, while in the conventional TF-IDF algorithm, in every language, those independent words with highest TF-IDF value is counted as keywords. As can be seen in Table 2, in Persian language, the word “America” is detected as a keyword (in thickened Table 2 mistakenly, while in English language, three words of “America, England, and London” (in thickened Table 2 were mistakenly detected as a keyword. In other languages, two or three keywords were also known as keywords, mistakenly. Table 3 illustrates the proposed algorithm results for the selected text. As can be seen in the table, the mean TF-IDF is calculated in eight languages (the proposed algorithm) for each word depicted in Table 1 and seven keywords were selected. The selected keywords in this method are considered for all eight languages, such that for all languages in this text, keywords in the mean method, which are shown in Table 3 include Quds, Zionist, America, demonstration, people, Palestine, and Iran, in which America is detected mistakenly as a keyword for all languages. However, as we mentioned in Table 2, in the conventional TF-IDF method the number of wrong detected keywords is different and more than one word for most languages. If we evaluate the accuracy of mean TF-IDF algorithm (the proposed one) and that of the conventional algorithm, the conventional algorithm (which is shown in Table 2) 6 of 7 Persian words, 4 of 7 English words, and 3 of 7 Arabic words, as well as other words in other languages were detected, correctly. In total, in 8 languages and among 56-7*8 correct keywords, 39 were detected correctly and the accuracy of the algorithm is 0.69=39/56, while in the mean TF-IDF method, 6 of 7 words were detected correctly for all languages and the accuracy of the algorithm is 0.85=6/7. This is the case of the mean and maximum method.
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916 5914 Table 1. The TF-IDF value for words that are most likely to be among the key words for the selected text Table 2. The results of typical TFIDF algorithms, thick words are mistaken for keywords Table 3. Results of the proposed algorithm, TFIDF improved thick words are words that are mistakenly identified as keywords Maximum Method (Selected Keywords for Any 8 Languages) Medium Method (Selected Keywords for All 8 Languages) Average method (Selected keywords for all 8 languages) Right keywords 50 March 50 March 47 Ghods March 47 Ghods 47 Ghods 40.12 Zionist Ghods 43 Zionist 43 Zionist 38.87 America Global 43 America 43 America 36.87 March Zionist 40 Palestine 38 People 29.25 People Iran 29 Terrorist 23.5 Palestine 25.87 Palestine People 38 People 21 Iran 22 Iran Palestine 5.2. The comparison of the obtained results with the other related algorithms To evaluate the efficiency and performance, the rate of accuracy of the proposed algorithm is compared with that of the other methods. The algorithm was tested with 200 texts in eight languages, which are shown in Table 1, and 1200 correct keywords were achieved. The rate of accuracy of the conventional TF-IDF algorithm for 1014 correct words and 1672 obtained keywords is 60.6%, while the proposed algorithm, namely the mean. TF-IDF, for 1164 correct words of 1275 words, the rate is 91.3%. In the proposed algorithm with the median method, 1092 correct words of 1456 words indicate the rate of 75%. Moreover, if we calculate the accuracy rate for the maximum method, 1021 correct words of 1531 words by the accuracy rate of 66.6% is obtained. The rate of accuracy for graph-based algorithm [27] for these data is 80%. Concerning the obtained rates, mean with the accuracy rate of 91.3% is the best method. Table 4 shows the summary of results on BBC data. This suggests that the proposed algorithm not only extracted the keywords language independent, but has achieved a considerably better results. Table 5 shows comparison the algorithm with other related algorithms.
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Improving keyword extraction in multilingual texts (Bahareh Hashemzadeh) 5915 Table 4. Suggested algorithm accuracy rates and keywords extraction algorithms on BBC data Algorithm TF-IDF Maximum suggestion TF-IDF Suggestion middleware TF-IDF Suggested average Graph [27] TF-IDF normal Accuracy rate 66.6% 75% 91.3% 80% 60.6% Table 5. Comparing the algorithm with other related algorithms Algoritm Accuracy The Proposed Algorithm 91.3% Graph[27] 80% Kp[28] 47.7% MSF [29] 60% GATE[30] 64.4% Habibi[1] 75% Single-Document[31] 83.2% 6. CONCLUSION Data retrieval is widely applied in everyday life. Increasing the efficiency and performance of information retrieval systems is very important for their designers. We realized based on investigating the applications of the data retrieval and text mining that the keywords of a text are important and facilitate the oriantations of the processes. For example, by finding the keywords in the news or some sentences with more keywords, we could summarize or comprehend the text more easily. To achieve to this aim, an unsupervised keywords extraction algorithm is proposed based on improving the TF-IDF algorithm for multi-language texts. In the proposed algorithm, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. A database, which was collected 200 news from BBC website in various languages, was considered to evaluate the efficiency of the proposed algorithm. The experimental results show that the selected keywords are more similar to the mentioned keywords by the website and this confirms the reliability of the algorithm. The overall accuracy rate of the algorithm is 91.3% that it is higher than the state-of-the-art keyword extraction algorithms. We would like to introduce three strategies as our future works, to improve the proposed algorithm in application, complexity and time. Finding complex keywords could be added to the algorithm, real-time and on-line behaviour could be created by focusing on parallel processing and normalizing the feeds’ addresses could be considered to facilitate access. REFERENCES [1] M. Habibi and A. Popescu-Belis, “Keyword extraction and clustering for document recommendation in conversations,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 23, no. 4, pp. 746-759, 2015. [2] M. Savić, et al., “A language-independent approach to the extraction of dependencies between source code entities,” Information and Software Technology, vol. 56, no. 10, pp. 1268-1288, 2014. [3] S. Siddiqi and A. Sharan, “Keyword and keyphrase extraction techniques: a literature review,” International Journal of Computer Applications, vol. 109, no. 2, pp. 18-23, 2015. [4] T. S. Chung, et al., “A survey of flash translation layer,” Journal of Systems Architecture, vol. 55, no. 5-6, pp. 332-343, 2009. [5] N. I. Abdulkhaleq, et al., “Improving the data recovery for short length LT codes,” International Journal of Electrical & Computer Engineering, vol. 10, no. 2, pp. 1972-1979, 2020. [6] N. N. Kulkarni and S. A. Jain, “Checking integrity of data and recovery in the cloud environment,” Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 13, no. 2, pp. 626-633, 2019. [7] E. Cambria and B. White, “Jumping NLP curves: A review of natural language processing research,” IEEE Computational intelligence magazine, vol. 9, no. 2, pp. 48-57, 2014. [8] V. Jain and S. V. A. V. Prasad, “Ontology based information retrieval model in semantic web: a review,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 8, pp. 837-842, 2014. [9] K. Kim, et al., “Language independent semantic kernels for short-text classification,” Expert Systems with Applications, vol. 41, no. 2, pp. 735-743, 2014. [10] D. Deshwal, et al., “Feature Extraction Methods in Language Identification: A Survey,” Wireless Personal Communications, vol. 107, pp. 2071-2103, 2019. [11] S. K. Bharti and K. S. Babu, “Automatic keyword extraction for text summarization: A survey,” arXiv preprint arXiv:1704.03242, 2017. [12] E. Ferrara, et al., “Web data extraction, applications and techniques: A survey,” Knowledge-based systems, vol. 70, pp. 301-323, 2014. [13] P. Turney, “Learning to extract keyphrases from text,” National Research Council Canada, 2002.
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 - 5916 5916 [14] S. S. Hong, et al., “The feature selection method based on genetic algorithm for efficient of text clustering and text classification,” International Journal of Advances in Soft Computing and its Applications, vol. 7, no. 1, pp. 22-40, 2015. [15] E. Frank, et al., “Domain-specific keyphrase extraction,” in 16th International joint conference on artificial intelligence (IJCAI 99), vol. 2, pp. 668-673, 1999. [16] C. Caragea, et al., “Citation-enhanced keyphrase extraction from research papers: A supervised approach,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1435-1446, 2014. [17] G. Ercan and I. Cicekli, “Using lexical chains for keyword extraction,” Information Processing & Management, vol. 43, no. 6, pp. 1705-1714, 2007. [18] F. Fkih and M. N. Omri, “Complex terminology extraction model from unstructured web text based linguistic and statistical knowledge,” International Journal of Information Retrieval Research, vol. 2, no. 3, pp. 1-18, 2013. [19] G. Figueroa, et al., “RankUp: Enhancing graph-based keyphrase extraction methods with error-feedback propagation,” Computer Speech & Language, vol. 47, pp. 112-131, 2018. [20] S. Lahiri, et al., “Keyword and keyphrase extraction using centrality measures on collocation networks,” arXiv preprint arXiv:1401.6571, 2014. [21] P. Tonella, et al., “Using keyword extraction for web site clustering,” in Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings, pp. 41-48, 2003. [22] S. K. Biswas, et al., “A graph based keyword extraction model using collective node weight,” Expert Systems with Applications, vol. 97, pp. 51-59, 2018. [23] R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411, 2004. [24] S. Duari and V. Bhatnagar, “sCAKE: Semantic Connectivity Aware Keyword Extraction,” Journal of Information Sciences, vol. 477, pp. 100-117, 2019. [25] K. S. Hasan and V. Ng, “Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 365-373, 2010. [26] R. Farhad, et al., “Improved Clustering Persian Text Based on Keyword Using Linguistic and Thesaurus Knowledge,” Signal and Data Processing, vol. 13, no. 1, pp. 87-100, 2016. [27] A. R. Nabhan and K. Shaalan, “Keyword identification using text graphlet patterns,” in International Conference on Applications of Natural Language to Information Systems, pp. 152-161, 2016. [28] C. B. Ali, et al., “A two-level keyphrase extraction approach,” in International Conference on Intelligent Text Processing and Computational Linguistics, pp. 390-401, 2015. [29] D. Y. Lee, et al., “A New Extraction Algorithm for Hierarchical Keyword Using Text Social Network,” in Information Science and Applications (ICISA) 2016, pp. 903-912, 2016. [30] P. Nesi, et al., “A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents,” in 21st International Conference on Distributed Multimedia Systems (DMS 2015), pp. 1-7, 2015. [31] F. Rousseau and M. Vazirgiannis, “Main core retention on graph-of-words for single-document keyword extraction,” in European Conference on Information Retrieval, pp. 382-393, 2015. BIOGRAPHIES OF AUTHORS Bahare Hashemzade is lecturing and researching at Electrical and Computer Engineering, University of Torbat Heydarieh from 2016. She has graduated at M.Sc. of Information Science from Birjand Uniersity in 2015. Her interest fields are information technology, obfuscation and data mining. Majid Abdolrazzagh-Nezhad is lecturing and researching as assistant professor at the computer engineering department of the University of Bozorgmehr Qaenat since 2013 and dean the department since 2016. He was dean of the faculty of computer science between 2013 to 2016. Abdolrazzagh-Nezhad Supports Master and PhD students of Islamic Azad University of Birjand since 2016. He has graduated at PhD in Computer Science from Information Science and Technology, Faculty of the National University of Malaysia (UKM) in 2013. Also, he received his master degree of Operation Research from University of Sistan and Blochstan in 2007, and his bachelor’s degree from University of Birjand in 2004. His interest fields are artificial intelligent, optimization, data mining, scheduling and uncertain systems. He is a young Professionals member of the Institute of Electrical and Electronics Engineering (IEEE) and reviewer of valid journals such as Information Sciences, Applied Soft Computing, Soft Computing, International Journal of Production Research, IEEE Transaction on Industrial Electronics and IEEE Transaction on Industrial Informatics.