Jan Zizka et al. (Eds) : ACSTY, NATP - 2016
pp. 49– 60, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.61404
DICTIONARY BASED AMHARIC-ARABIC
CROSS LANGUAGE INFORMATION
RETRIEVAL
H L Shashirekha1
and Ibrahim Gashaw2
Department of Computer Science, Mangalore University,
Mangalagangotri, Mangalore-574199
1
hlsrekha@gmail.com
2
ibrahimug1@gmail.com
ABSTRACT
The demand for multilingual information is becoming perceptive as the users of the internet
throughout the world are escalating and it creates a problem of retrieving documents in one
language by specifying query in another language. This increasing demand can be addressed by
designing automatic tools, which accepts the query in one language and retrieves the relevant
documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to
retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and
retrieving the relevant documents both Amharic and Arabic.
KEYWORDS
Information Retrieval, Dictionary, Machine Translation, Relevance Feedback.
1. INTRODUCTION
With the rapid growth of the Internet, the World Wide Web (WWW) has become one of the most
popular medium for spreading multilingual information. The need for multilingual information is
becoming perceptive as the users of the internet throughout the world are ever increasing. This
ability to disseminate multilingual information has increased the need to automatically intervene
across multiple languages, and in the case of the WWW, access to “foreign language” Web pages
[1]. The increasing necessity for retrieval of multilingual documents opens up a new branch of
Information Retrieval (IR) called Cross Lingual Information Retrieval (CLIR) [2]. Its goal is to
accept information, transform it into a searchable format and provide an interface to allow a user
to search and retrieve information in different languages [3]. CLIR has lot of applications, such as
adhoc retrieval, text summarization, question answering, and text classification to ensure maximal
accessibility to digital repository for much wider audience [4].
In addition to the challenges of conventional IR, CLIR systems possess lot of challenges related
to language issues [5], such as;
50 Computer Science & Information Technology (CS & IT)
a. Translation disambiguation, due to homonymy and polysemy [6] creates problems to find
the most appropriate translation for a given word
b. Lacking appropriate resources for evaluations of CLIR with low density languages
c. Inflection words in the query cannot be easily located as translated root words in the
dictionary, due to stemming
d. New words get added to the language which may not be recognized by the existing
system, resulting in out of vocabulary (OOV) and
e. Most of OOV words such as technical terms and named entities in the query reduces the
performance of the system
According to Cardenosa et.al, [5], CLIR approaches can be categorized into three; Document
translation, Query translation, and Interlingua translation.
• In document translation, every document has to be translated into the query language
and then retrieval will be performed using classical IR techniques. It can be applied
offline to produce translations of all documents well in advance and offers the possibility
to access the content in his/her own language. However, machine or (large scale) human
translation may not always be a realistic option for every language pair as it is time
consuming since every document needs to translated to other languages irrespective of
their usage.
• Query translation approach is the translation of query terms from source language to
the target language. In this approach online translation can be applied to the query
entered by a user and it is possible for a user to reformulate, elaborate or narrow down the
translated query. Translating a query by dictionary look-up is far more efficient than
translating entire document collection. However, it is unreliable since short queries do not
provide enough contexts for disambiguation in choosing proper translation of query
words and does not exploit domain-specific semantic constraints and corpus statistics in
solving translation ambiguity.
• In Interlingua translation approach, the source language, i.e. the text to be translated is
transformed into an Interlingua, i.e., an abstract language-independent representation.
The target language is then generated from the Interlingua. This approach is useful if
there are no resources for a direct translation but it has lower performance than direct
translation.
Translation techniques in CLIR are categorized into direct and indirect translation [7]. Direct
translation uses Machine Readable Dictionary (MRD), parallel corpora, and machine translation
algorithm or in combination.
• In Dictionary based translation the query words are translated to the target language
using MRD [8]. MRDs are electronic versions of printed dictionaries, and may be general
dictionaries, specific domain dictionaries, or a combination of both. It has been adopted
in CLIR because bilingual dictionaries are widely available.
Computer Science & Information Technology (CS & IT) 51
• Parallel corpora contain a set of documents and their translations in one or more other
languages. These paired documents can be used to meet the most likely translations of
terms between languages in the corpus.
• Query translation can be implemented by using a Machine Translation (MT) system to
translate documents in one languages in the corpora into the language of a user’s query
which can be done offline in advance or online [9].
Indirect translation is a common solution when there is an absence of resources supporting direct
translation. It can be applied by transitive or dual translation system. In case of transitive
translation, the use of an intermediary (pivot) language, which is placed between the source query
and the target document collection, is used to enable comparison with the target document
collection. In the case of dual translation systems, both the query and the document
representations are translated into the intermediate language [10].
In all the above-mentioned cases, a key element is the mechanism to map between languages.
This translation knowledge can be encoded in different forms as a data structure of query and
document-language term correspondences in a MRD or as an algorithm, such as a machine
translation or machine transliteration system [11]. While all of these forms are effective, the latter
require substantial investment of time and resources for the development and it is not widely or
readily available for many language pairs.
CLIR is becoming a promising field of research which bridges the gap between different
languages and hence between different people speaking different languages and of different
culture. As CLIR is in its infancy, many works related to many language pairs are attempted.
Amharic-Arabic is one such language pairs which needs to explore for CLIR.
According to the 2007 census, Amharic speakers encompass 26.9% of Ethiopia’s population.
Amharic is also spoken by many people in Israel, Egypt and Sweden [1]. Arabic is a natural
language spoken by 250 million people in 21 countries as the first language, and Islamic countries
as a second language [8]. Ethiopia is one of the countries, which have more than 33.3% of the
population who follow Islam, and they use Arabic language to teach religion and for
communication purpose. The Arabic and Amharic languages belonging to the Semitic family of
languages [12], where the words in such languages are formed by modifying the root itself
internally and not simply by the concatenation of affixes to word roots. Amharic and Arabic are
very rich morphology languages.
The current Amharic writing system consists of a core of thirty-three characters (ፊደል, fidel) each
of which occurs in a basic form and in six other forms known as orders [1]. The non-basic forms
are derived from the basic forms by more-or-less regular modifications. Thus, there are 231
different characters. The seven orders represent syllable combinations consisting of consonant
and following vowel. This characteristic according to Abebayehu [13], makes the Amharic
writing system a syllabic writing system. A character or a symbol is used to represent a phoneme,
which is a combination of a vowel and a consonant. These are written in a unique script that is
now supported in Unicode (U+1200 - U+137F) [14].
52 Computer Science & Information Technology (CS & IT)
The Arabic alphabet consists of 28 characters or 29 characters if the Hamza is considered as a
separate character. It is written from right to left like Persian, Hebrew, unlike many international
languages. Three of the Arabic characters appear in different shapes as follows [15][16]:
• Hamza (‫)ء‬ is sometimes written :‫,ا‬ ِ‫إ‬ or ‫أ‬ (alif)
• Ta marbouta (‫)ة‬ like t in English found atthe end without two dots ( o = ha)
• Alifmaqsurah (‫)ى‬ is the character (‫=ي‬ya ) without dots.t
The above three characters pose some difficulties in the setting up a CLIR system. Some of
Arabic language resources ignore the Hamza and the dots (.) above “ta marbouta” to unite the
input and output for these characters. In Arabic there is a whole series of non-alphabetic signs,
added above or below the consonant letters to make the reading of the word less ambiguous.
Both Arabic and Amharic languages possess translation challenges for many reasons [17][18];
such as Arabic sentences are usually long and punctuation has no or little effect on interpretation
of the text. Contextual analysis is important in Arabic and Amharic in order to understand the
exact meaning of some words. For example, in Amharic, the word “ገና” can have the meaning of
Christmas holiday or waiting something until it happens. Characters are sometimes stretched for
justified text, which hinders the exact much for same word. In Arabic, synonyms are very
common. For example, “year” has three synonyms in Arabic ‫َام‬‫ع‬، ‫،حول‬ ‫سنة‬ and all are widely used
in every day communication. Another challenge in Arabic is the absence of discretization
(sometimes called vocalization). Discretization can be defined as a symbol over and underscored
letters, which are used to indicate the proper pronunciations as well as for disambiguation
purposes. The absence of discretization in Arabic texts poses a real challenge for Arabic natural
language processing, As well as for translation, leading to high ambiguity. Though the use of
discretization is extremely important for readability and understanding, they don’t appear in most
printed media in Arabic regions nor on Arabic Internet web sites. They are visible in religious
texts such as Quran, which is fully discretised in order to prevent misinterpretation.
Ethiopia has good socio-economic relationships with Arabic countries; they are communicating
using the Arabic and Amharic languages. For example, reports sent between Ethiopia and Arabic
countries need to be written in both languages, and most of the new and translated religious books
are written in both languages by Muslim scholars. Similar to English, a large amount of
unstructured documents are available on the net in Arabic and Amharic languages. However, IR
tools and techniques are mostly English language oriented, and currently there are several
attempts to develop IR tools for Arabic and Amharic language. Many of Internet users who are
non-native Arabic speakers can read and understand Arabic documents but they feel
uncomfortable to formulate queries in Arabic. This may be either because of their limited
vocabulary in Arabic, or because of the possible miss-usage of Arabic words. Different attempts
have been made to develop CLIR systems for Amharic-French [19] and Afan Oromo-English [3]
languages. Nevertheless, CLIR system is not found for Amharic-Arabic language pair.
Development of standard corpus and tools is very essential in order to test the performance of the
newly developed CLIR system [20]..
Computer Science & Information Technology (CS & IT) 53
The aim of this research work is to develop a prototype of dictionary based Amharic-Arabic
CLIR system that enables Amharic and Arabic language users to retrieve both language
documents and to examine the ability of the proposed system. We employee query translation
strategy, which is more efficient than document translation strategy, because the document
translation strategy require overhead cost of translating all documents, especially when new
documents are added frequently and not all of the documents are of interest to the users [21].
The remainder of this paper is organized as follows; the review of related works is presented in
Section 2 and the proposed CLIR method in Section 3. Section 4 gives the experimental setup and
the results and the paper conclude in Section 5.
2. RELATED WORKS
Several researchers have studied CLIR works related to different language pairs. However, less
work is reported on Amharic and Arabic languages paired with other languages. Some of the
prominent works are discussed below
Argaw Atelach Alemu, et.al [19], present a dictionary based approach to translate the Amharic
queries into French Bags-of-words in the Amharic-French bilingual track at CLEF 2005 using the
search engines: SICS and Lucene. Non-content bearing words were removed both before and
after the dictionary lookup. TF/IDF values supplemented by a heuristic function was used to
remove the stop words from the Amharic queries and two French stop words lists were used to
remove stop words from French translations. From the experiments, they found that the SICS
search engine performed better than Lucene. Aljlayl et.al [1], empirically evaluated the use of an
MT-based approach for query translation in an Arabic-English CLIR system using TREC-7 and
TREC-9 topics and collections. The effect of query length on the performance of MT is also
investigated to explore how much context is actually required for successful MT processing. A
well-formed source query makes the MT system able to provide its best accuracy. Tesfaye Fasika
[20], employed a corpus based approach which makes use of phrasal query translation for
Amharic-English CLIR. The result of the experimentation is a recall value of 24.8% for translated
Amharic queries, 46.3% for Amharic queries and 43.6% for the baseline English queries.
Nigussie Eyob [7], have developed a corpus based Afaan Oromo–Amharic CLIR system to
enable Afaan Oromo speakers to retrieve Amharic information using Afaan Oromo queries.
Documents including news articles, bible, legal documents and proclamations from customs
authority were used as parallel corpus. Two experiments were conducted, by allowing only one
possible translation to each Afaan Oromo query term and by allowing all possible translations.
The first experiment returned a maximum average precision of 81% and 45% for monolingual
(Afaan Oromo) queries and bilingual (translated Amharic) queries run respectively. The second
experiment showed better result of recall and precision than the first experiment, which is 60%
for the bilingual query run, and the result for the monolingual query run remained the same.
Mequannint et al. [22], designed a model for an Amharic-English Search Engine and developed a
bilingual Web search engine based on the model that enables Web users for finding the
information they need in Amharic and English languages. They have identified different language
dependent query pre-processing components for query translation and developed a bidirectional
dictionary-based translation system, which incorporates a transliteration component to handle
proper names, which are often missing in bilingual lexicons. They used an Amharic search engine
and an open source English search engine (Nutch) for Web document crawling, indexing,
54 Computer Science & Information Technology (CS & IT)
searching, ranking and retrieving. The experimental results showed that the Amharic-English
Cross-Lingual Retrieval engine performed 74.12% of its corresponding English monolingual
retrieval engine and the English-Amharic Cross-Lingual Retrieval engine performed 78.82% of
its corresponding Amharic monolingual retrieval engine.
In CLIR, the semantic level of words is crucial. Solving the problem of word sense
disambiguation will enhance the effectiveness of CLIR systems. Andres Duque et al [23], studied
to choose the best dictionary for Cross Lingual Word Sense Disambiguation (CLWSD). They
applied the comparison between different dictionaries in two different frameworks; analysing the
potential results of an ideal system using those dictionaries and considering the particular
unsupervised CLWSD system Co-occurrence Graph, then analyse the results obtained when using
different bilingual dictionaries providing the potential translations. They also developed hybrid
system by combining the results provided by a probabilistic dictionary, and those obtained with a
Most Frequent Sense (MFS) approach. They have focused on only on English- Spanish cross-
lingual disambiguation. The hybrid approach outperforms the results obtained by other
unsupervised systems.
As Arabic is a relatively widely researched Semitic language and has a number of common
properties that share with Amharic, some of the computational linguistic research [1],[19],[24],
conducted on Amharic and Arabic languages nowadays recommended customizing and using the
tools developed for these languages. While the above researchers has attempted to develop and
evaluate Amharic and Arabic paired languages with other languages separately, no research has
these two languages paired together.
3. METHODOLOGY
In this work, an attempt has been made to design a dictionary based Amharic-Arabic CLIR
system, which has indexing and searching tasks. Inverted file indexing structure is used to
organize documents to speed up searching. The probabilistic model that attempts to simulate the
uncertainty nature of an IR system guides the searching process. Amharic and Arabic documents
are pre-processed separately by performing tokenization, normalization, stop word removal,
punctuation removal and stemming. Figure 3.1 shows the general architecture of the system,
which is adopted from C. Peters et al [25]. Bi-lingual dictionary, which includes the list of
Amharic and Arabic translated words is constructed manually and is used to translate Amharic
queries to Arabic queries.
Binary independent probabilistic information retrieval model is adopted to search the relevant
documents from Amharic-Arabic parallel corpus. Probabilistic information retrieval is the
estimation of the probability of relevance that a document di will be judged relevant by the user
with respect to query q, which is expressed as, P(R|q, di), where, R is the set of relevant
documents. Typically, in probabilistic model, based on the query the documents are divided into
relevant and irrelevant documents [26]. However, the probability of any document is relevant or
irrelevant with respect to users query is initially unknown. Therefore, the probabilistic model
needs to guess the relevance at the beginning of search process. The user then observes the first
retrieved documents and gives feedback for the system by selecting relevant documents as
relevant and irrelevant documents as irrelevant. By collecting relevance feedback data from a few
documents, the model can then be applied to estimate the probability of relevance for the
remaining documents in the collection. This process is applied iteratively to improve the
Computer Science & Information Technology (CS & IT) 55
performance of the system to retrieve more and more relevant documents, which satisfies the
users need.
Figure 3.1 Dictionary based Amharic-Arabic CLIR system architecture
56 Computer Science & Information Technology (CS & IT)
The assumptions made for the uncertainty nature of probability model are;
• p(ki|R) is constant for all index terms k (usually, its equal to 0.5)
• The distribution of index terms among the non-relevant documents can be approximated
by the distribution of index terms among all the documents in the collection.
These two assumptions will give;
where, N is the total number of documents in the collection and ni is the number of documents
which contain the index term ki.
4. EXPERIMENTATION AND EVALUATION
The Holy Quran available through Tanzile Quran navigator website [27] includes 114 chapters,
each containing a minimum of 3 to a maximum of 286 verses in Arabic Amharic languages. In
this work, subject to the availability of the number of verses, we have downloaded upto 10 verses
from each chapter in Arabic and the corresponding verses in Amharic.
Even though complete evaluation process requires the evaluation of both system effectiveness
and efficiency, only effectiveness of IR system is taken into consideration to determine the
performance of the system for the translated queries. Precision and recall are used to measure the
effectiveness of the IR system designed.
We used Amharic queries for the retrieval of documents both in Arabic and Amharic languages.
In addition to retrieving Amharic documents, the Amharic query is translated into Arabic for
retrieving Arabic documents. We used 14 simple queries to test the performance of the system
and the results obtained are shown in Table 4.1. The performance of the system on Arabic
relevant retrieved documents is much better than that of Amharic documents (i.e., 83.89%
precision for Amharic against 52.02% precision for Arabic).
When the system is tested by giving queries that has Out of Vocabulary words in the dictionary,
its precision is decreased and recall is increased specially for Arabic documents. For example, if
we add a word “ለኾነው” (to become) which is not translated correctly or appeared in the
dictionary for the first query “የፍርዱ ቀን ባለቤት ለኾነው” (Financed you day of the debt) the word
“ለኾነው” (to become) is directly used for searching. Therefore, the number of Amharic non
relevant documents increased by highly decreasing the performance of the system. The main
hindrance of the system performance is incorrect translation due to unnormalized Arabic words
specifically diacritics for mapped with the dictionary words, system that cannot be.
Computer Science & Information Technology (CS & IT) 57
Table 4.1 Performance of the proposed system
5. CONCLUSION
Multilingual information is required for the countries that have multiple languages and it is vital
as the users of the internet throughout the world are ever increasing. We have developed a
prototype of dictionary based Amharic-Arabic CLIR system that enables Amharic and Arabic
language users to retrieve both language documents and to examine the ability of the proposed
system. The effectiveness of our proposed system was evaluated and the performance of the
system on Arabic relevant retrieved documents was much better than that of Amharic documents.
58 Computer Science & Information Technology (CS & IT)
The main challenges with dictionary-based CLIR are untranslatable words due to the limitation of
Amharic Arabic general dictionary, the processing of inflected words, Phrase identification and
translation, and lexical ambiguity in Amharic and Arabic language.
Even if this research has a vital significance in retrieving the required information from Amharic-
Arabic document, some issues need to be further investigated to develop efficient and effective
CLIR system. This approach requires an exhaustive and detailed list of mapping of concepts in
both languages, which is very difficult to build.
REFERENCES
[1] M. Aljlayl, O. Frieder, and D. Grossman, “On Arabic-English cross-language information retrieval: A
machine translation approach,” in Information Technology: Coding and Computing, 2002.
Proceedings. International Conference on, 2002, pp. 2–7.
[2] K. Sourabh, “An Extensive Literature Review on CLIR and MT activities in India,” Int. J. Sci. Eng.
Res., 2013.
[3] D. Bekele, “Afaan Oromo Oromo-English Cross-Lingual Information Retrieval (Clir),” AAU, 2011.
[4] D. Kelly, “Methods for evaluating interactive information retrieval systems with users,” Found.
Trends Inf. Retr., vol. 3, no. 1—2, pp. 1–224, 2009.
[5] J. Cardeñosa, C. Gallardo, and A. Toni, “Multilingual Cross Language Information Retrieval A new
approach.”
[6] M. Abusalah, J. Tait, and M. Oakes, “Literature Review of Cross Language Information Retrieval,”
Comput. Hum., pp. 175–177, 2005.
[7] E. Nigussie, “Afaan Oromo--Amharic Cross Lingual Information Retrieval,” AAU, 2013.
[8] T. Hedlund, “Dictionary-based cross-language information retrieval: principles, system design and
evaluation,” in SIGIR Forum, 2004, vol. 38, no. 1, p. 76.
[9] M. R. Warrier and M. S. S. Govilkar, “A SURVEY ON VARIOUS CLIR TECHNIQUES.”
[10] D. Zhou, M. Truran, T. Brailsford, V. Wade, and H. Ashman, “Translation techniques in cross-
language information retrieval,” ACM Comput. Surv., vol. 45, no. 1, p. 1, 2012.
[11] G.-A. Levow, D. W. Oard, and P. Resnik, “Dictionary-based techniques for cross-language
information retrieval,” Inf. Process. Manag., vol. 41, no. 3, pp. 523–547, 2005.
[12] A. D. Rubin, “The Subgrouping of the Semitic Languages,” Linguist. Lang. Compass, vol. 2, no. 1,
pp. 79–102, 2008.
[13] S. ABEBAYEHU, “Amharic-English Script Identification in Real-Life Document Images,” aau,
2012.
[14] B. Ayalew, “The submorphemic structure of Amharic: toward a phonosemantic analysis,” University
of Illinois at Urbana-Champaign, 2013.
Computer Science & Information Technology (CS & IT) 59
[15] R. Tsarfaty, “Syntax and Parsing of Semitic Languages,” in Natural Language Processing of Semitic
Languages, Springer, 2014, pp. 67–128.
[16] H. Ishkewy, H. Harb, and H. Farahat, “Azhary: An arabic lexical ontology,” arXiv Prepr.
arXiv1411.1999, 2014.
[17] T. Hailemeskel, “Amharic Text Retrieval: An Experiment Using Latent Semantic Indexing (LSI) with
Singular Value Decomposition (SVD),” M. Sc. Thesis, Addis Ababa University, Addis Ababa, 2003.
[18] F. Ahmed and A. Nurnberger, “Arabic/English word translation disambiguation approach based on
na{"i}ve Bayesian classifier,” in Computer Science and Information Technology, 2008. IMCSIT
2008. International Multiconference on, 2008, pp. 331–338.
[19] A. A. Argaw, L. Asker, J. Karlgren, M. Sahlgren, and R. Cöster, “Dictionary-based Amharic-French
information retrieval,” CEUR Workshop Proc., vol. 1171, 2005.
[20] F. Tesfaye, “Phrasal Translation for Amharic English Cross Language Information Retrieval (Clir),”
AAU, 2010.
[21] M. Adriani, “Using statistical term similarity for sense disambiguation in cross-language information
retrieval,” Inf. Retr. Boston., vol. 2, no. 1, pp. 71–82, 2000.
[22] M. Munye and S. Atnafu, “Amharic-English bilingual web search engine,” in Proceedings of the
International Conference on Management of Emergent Digital EcoSystems, 2012, pp. 32–39.
[23] A. Duque, J. Martinez-Romo, and L. Araujo, “Choosing the best dictionary for Cross-Lingual Word
Sense Disambiguation,” Knowledge-Based Syst., vol. 81, pp. 65–75, 2015.
[24] S. A. L. S. F. Adafre, “Machine Translation for Amharic: Where we are,” Strateg. Dev. Mach. Transl.
Minor. Lang., p. 47.
[25] C. Peters, M. Braschler, and P. Clough, Multilingual information retrieval: From research to practice.
Springer Science & Business Media, 2012.
[26] F. Dahak, M. Boughanem, and A. Balla, “A probabilistic model to exploit user expectations in XML
information retrieval,” Inf. Process. Manag., 2016.
[27] “http://tanzil.net/#trans/am.sadiq.” .
AUTHORS
Ibrahim Gashaw Kassa, is a Ph.D. candidate at Mangalore University
Karnataka State, India since 2016. He graduated in 2006 in Information System
from Addis Ababa University, Ethiopia. In 2014, he obtained his master’s degree
in Information Technology from University of Gondar, Ethiopia., he serves as a
lecturer at University of Gondar from 2009 to May 2016. His research interest
is in Cross Language Information Retrieval.
60 Computer Science & Information Technology (CS & IT)
Dr. H L Shashirekha is an Associate Professor in the Department of Computer
Science, Mangalore University, Mangalore, Karnataka State, India. She
completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from
University of Mysore. She is a member of Board of Studies and Board of
Examiners (PG) in Computer Science, Mangalore University. She has
several papers in International Conferences and published several papers in
International Journals and Conference Proceedings. Her area of research includes
Text Mining and Natural Language Processing.
Computer Science & Information Technology (CS & IT)
is an Associate Professor in the Department of Computer
ersity, Mangalore, Karnataka State, India. She
completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from
University of Mysore. She is a member of Board of Studies and Board of
Examiners (PG) in Computer Science, Mangalore University. She has presented
several papers in International Conferences and published several papers in
International Journals and Conference Proceedings. Her area of research includes
Text Mining and Natural Language Processing.

More Related Content

PDF
A Review on the Cross and Multilingual Information Retrieval
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
PDF
04. 9990 16097-1-ed (edited arf)
PDF
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
PDF
ATAR: Attention-based LSTM for Arabizi transliteration
PDF
Hybrid approaches for automatic vowelization of arabic texts
PDF
Summer Research Project (Anusaaraka) Report
A Review on the Cross and Multilingual Information Retrieval
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
04. 9990 16097-1-ed (edited arf)
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
ATAR: Attention-based LSTM for Arabizi transliteration
Hybrid approaches for automatic vowelization of arabic texts
Summer Research Project (Anusaaraka) Report

What's hot (19)

PDF
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
PDF
Language Identifier for Languages of Pakistan Including Arabic and Persian
PDF
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PDF
Error Analysis of Rule-based Machine Translation Outputs
PDF
Design of A Spell Corrector For Hausa Language
PDF
Hybrid part of-speech tagger for non-vocalized arabic text
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PPTX
Machine translation with statistical approach
PDF
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
PPTX
Machine translation from English to Hindi
PDF
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
PDF
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid Approach
PPT
Types of machine translation
PDF
Ny3424442448
PDF
ReseachPaper
PDF
Marathi Text-To-Speech Synthesis using Natural Language Processing
PDF
Improving performance of english hindi cross language information retrieval u...
PDF
Mediterranean Arabic Language and Speech Technology Resources
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
Language Identifier for Languages of Pakistan Including Arabic and Persian
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
Error Analysis of Rule-based Machine Translation Outputs
Design of A Spell Corrector For Hausa Language
Hybrid part of-speech tagger for non-vocalized arabic text
Survey on Indian CLIR and MT systems in Marathi Language
Machine translation with statistical approach
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
Machine translation from English to Hindi
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid Approach
Types of machine translation
Ny3424442448
ReseachPaper
Marathi Text-To-Speech Synthesis using Natural Language Processing
Improving performance of english hindi cross language information retrieval u...
Mediterranean Arabic Language and Speech Technology Resources
Ad

Viewers also liked (20)

PDF
A SURVEY OF MARKOV CHAIN MODELS IN LINGUISTICS APPLICATIONS
PDF
ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENT
PDF
TOPIC BASED ANALYSIS OF TEXT CORPORA
PDF
THE IMPACT OF EXISTING SOUTH AFRICAN ICT POLICIES AND REGULATORY LAWS ON CLOU...
PDF
MODEL CHECKERS –TOOLS AND LANGUAGES FOR SYSTEM DESIGN- A SURVEY
PDF
FORMAL MODELING AND VERIFICATION OF MULTI-AGENTS SYSTEM USING WELLFORMED NETS
PDF
RECOGNITION OF RECAPTURED IMAGES USING PHYSICAL BASED FEATURES
PDF
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
PDF
APPROACH MULTI-AGENTS EMBEDDED ALARM IN POTROOMS
PDF
STOCHASTIC MODELING TECHNOLOGY FOR GRAIN CROPS STORAGE APPLICATION: REVIEW
PDF
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
PDF
PERFORMANCE EVALUATION OF OSPF AND RIP ON IPV4 & IPV6 TECHNOLOGY USING G.711 ...
PDF
COMBINING REUSABLE TEST CASES AND CONTINUOUS SECURITY TESTING FOR REDUCING WE...
PDF
UBIQUITOUS COMPUTING AND SCRUM SOFTWARE ANALYSIS FOR COMMUNITY SOFTWARE
PDF
TRACEABILITY OF UNIFIED MODELING LANGUAGE DIAGRAMS FROM USE CASE MAPS
PDF
EFFICIENCY OF SOFTWARE DEVELOPMENT AFTER IMPROVEMENTS IN REQUIREMENTS ENGINEE...
PDF
COMPARATIVE STUDY FOR PERFORMANCE ANALYSIS OF VOIP CODECS OVER WLAN IN NONMOB...
PDF
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
PDF
EVALUATION OF SOFTWARE DEGRADATION AND FORECASTING FUTURE DEVELOPMENT NEEDS I...
PDF
CENTROG FEATURE TECHNIQUE FOR VEHICLE TYPE RECOGNITION AT DAY AND NIGHT TIMES
A SURVEY OF MARKOV CHAIN MODELS IN LINGUISTICS APPLICATIONS
ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENT
TOPIC BASED ANALYSIS OF TEXT CORPORA
THE IMPACT OF EXISTING SOUTH AFRICAN ICT POLICIES AND REGULATORY LAWS ON CLOU...
MODEL CHECKERS –TOOLS AND LANGUAGES FOR SYSTEM DESIGN- A SURVEY
FORMAL MODELING AND VERIFICATION OF MULTI-AGENTS SYSTEM USING WELLFORMED NETS
RECOGNITION OF RECAPTURED IMAGES USING PHYSICAL BASED FEATURES
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
APPROACH MULTI-AGENTS EMBEDDED ALARM IN POTROOMS
STOCHASTIC MODELING TECHNOLOGY FOR GRAIN CROPS STORAGE APPLICATION: REVIEW
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
PERFORMANCE EVALUATION OF OSPF AND RIP ON IPV4 & IPV6 TECHNOLOGY USING G.711 ...
COMBINING REUSABLE TEST CASES AND CONTINUOUS SECURITY TESTING FOR REDUCING WE...
UBIQUITOUS COMPUTING AND SCRUM SOFTWARE ANALYSIS FOR COMMUNITY SOFTWARE
TRACEABILITY OF UNIFIED MODELING LANGUAGE DIAGRAMS FROM USE CASE MAPS
EFFICIENCY OF SOFTWARE DEVELOPMENT AFTER IMPROVEMENTS IN REQUIREMENTS ENGINEE...
COMPARATIVE STUDY FOR PERFORMANCE ANALYSIS OF VOIP CODECS OVER WLAN IN NONMOB...
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
EVALUATION OF SOFTWARE DEGRADATION AND FORECASTING FUTURE DEVELOPMENT NEEDS I...
CENTROG FEATURE TECHNIQUE FOR VEHICLE TYPE RECOGNITION AT DAY AND NIGHT TIMES
Ad

Similar to DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL (20)

PDF
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
PDF
Cross language information retrieval in indian
PDF
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
PDF
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
PDF
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
PDF
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
PDF
PDF
Ijetcas14 444
PPT
Arabic MT Project
PDF
A new hybrid metric for verifying
PDF
Design and Implementation of a Language Assistant for English – Arabic Texts
PDF
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
PDF
Ac04507168175
PDF
almisbarIEEE-1
DOC
Online handwritten script recognition (synopsis)
PDF
Building of Database for English-Azerbaijani Machine Translation Expert System
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
Cross language information retrieval in indian
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Ijetcas14 444
Arabic MT Project
A new hybrid metric for verifying
Design and Implementation of a Language Assistant for English – Arabic Texts
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Ac04507168175
almisbarIEEE-1
Online handwritten script recognition (synopsis)
Building of Database for English-Azerbaijani Machine Translation Expert System

Recently uploaded (20)

PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
semiconductor packaging in vlsi design fab
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
English Textual Question & Ans (12th Class).pdf
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
PDF
Empowerment Technology for Senior High School Guide
PPTX
Module on health assessment of CHN. pptx
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
My India Quiz Book_20210205121199924.pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
semiconductor packaging in vlsi design fab
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
English Textual Question & Ans (12th Class).pdf
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
Empowerment Technology for Senior High School Guide
Module on health assessment of CHN. pptx
B.Sc. DS Unit 2 Software Engineering.pptx
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
Introduction to pro and eukaryotes and differences.pptx
My India Quiz Book_20210205121199924.pdf
AI-driven educational solutions for real-life interventions in the Philippine...
Cambridge-Practice-Tests-for-IELTS-12.docx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Share_Module_2_Power_conflict_and_negotiation.pptx
Uderstanding digital marketing and marketing stratergie for engaging the digi...
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf

DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL

  • 1. Jan Zizka et al. (Eds) : ACSTY, NATP - 2016 pp. 49– 60, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.61404 DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL H L Shashirekha1 and Ibrahim Gashaw2 Department of Computer Science, Mangalore University, Mangalagangotri, Mangalore-574199 1 hlsrekha@gmail.com 2 ibrahimug1@gmail.com ABSTRACT The demand for multilingual information is becoming perceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic. KEYWORDS Information Retrieval, Dictionary, Machine Translation, Relevance Feedback. 1. INTRODUCTION With the rapid growth of the Internet, the World Wide Web (WWW) has become one of the most popular medium for spreading multilingual information. The need for multilingual information is becoming perceptive as the users of the internet throughout the world are ever increasing. This ability to disseminate multilingual information has increased the need to automatically intervene across multiple languages, and in the case of the WWW, access to “foreign language” Web pages [1]. The increasing necessity for retrieval of multilingual documents opens up a new branch of Information Retrieval (IR) called Cross Lingual Information Retrieval (CLIR) [2]. Its goal is to accept information, transform it into a searchable format and provide an interface to allow a user to search and retrieve information in different languages [3]. CLIR has lot of applications, such as adhoc retrieval, text summarization, question answering, and text classification to ensure maximal accessibility to digital repository for much wider audience [4]. In addition to the challenges of conventional IR, CLIR systems possess lot of challenges related to language issues [5], such as;
  • 2. 50 Computer Science & Information Technology (CS & IT) a. Translation disambiguation, due to homonymy and polysemy [6] creates problems to find the most appropriate translation for a given word b. Lacking appropriate resources for evaluations of CLIR with low density languages c. Inflection words in the query cannot be easily located as translated root words in the dictionary, due to stemming d. New words get added to the language which may not be recognized by the existing system, resulting in out of vocabulary (OOV) and e. Most of OOV words such as technical terms and named entities in the query reduces the performance of the system According to Cardenosa et.al, [5], CLIR approaches can be categorized into three; Document translation, Query translation, and Interlingua translation. • In document translation, every document has to be translated into the query language and then retrieval will be performed using classical IR techniques. It can be applied offline to produce translations of all documents well in advance and offers the possibility to access the content in his/her own language. However, machine or (large scale) human translation may not always be a realistic option for every language pair as it is time consuming since every document needs to translated to other languages irrespective of their usage. • Query translation approach is the translation of query terms from source language to the target language. In this approach online translation can be applied to the query entered by a user and it is possible for a user to reformulate, elaborate or narrow down the translated query. Translating a query by dictionary look-up is far more efficient than translating entire document collection. However, it is unreliable since short queries do not provide enough contexts for disambiguation in choosing proper translation of query words and does not exploit domain-specific semantic constraints and corpus statistics in solving translation ambiguity. • In Interlingua translation approach, the source language, i.e. the text to be translated is transformed into an Interlingua, i.e., an abstract language-independent representation. The target language is then generated from the Interlingua. This approach is useful if there are no resources for a direct translation but it has lower performance than direct translation. Translation techniques in CLIR are categorized into direct and indirect translation [7]. Direct translation uses Machine Readable Dictionary (MRD), parallel corpora, and machine translation algorithm or in combination. • In Dictionary based translation the query words are translated to the target language using MRD [8]. MRDs are electronic versions of printed dictionaries, and may be general dictionaries, specific domain dictionaries, or a combination of both. It has been adopted in CLIR because bilingual dictionaries are widely available.
  • 3. Computer Science & Information Technology (CS & IT) 51 • Parallel corpora contain a set of documents and their translations in one or more other languages. These paired documents can be used to meet the most likely translations of terms between languages in the corpus. • Query translation can be implemented by using a Machine Translation (MT) system to translate documents in one languages in the corpora into the language of a user’s query which can be done offline in advance or online [9]. Indirect translation is a common solution when there is an absence of resources supporting direct translation. It can be applied by transitive or dual translation system. In case of transitive translation, the use of an intermediary (pivot) language, which is placed between the source query and the target document collection, is used to enable comparison with the target document collection. In the case of dual translation systems, both the query and the document representations are translated into the intermediate language [10]. In all the above-mentioned cases, a key element is the mechanism to map between languages. This translation knowledge can be encoded in different forms as a data structure of query and document-language term correspondences in a MRD or as an algorithm, such as a machine translation or machine transliteration system [11]. While all of these forms are effective, the latter require substantial investment of time and resources for the development and it is not widely or readily available for many language pairs. CLIR is becoming a promising field of research which bridges the gap between different languages and hence between different people speaking different languages and of different culture. As CLIR is in its infancy, many works related to many language pairs are attempted. Amharic-Arabic is one such language pairs which needs to explore for CLIR. According to the 2007 census, Amharic speakers encompass 26.9% of Ethiopia’s population. Amharic is also spoken by many people in Israel, Egypt and Sweden [1]. Arabic is a natural language spoken by 250 million people in 21 countries as the first language, and Islamic countries as a second language [8]. Ethiopia is one of the countries, which have more than 33.3% of the population who follow Islam, and they use Arabic language to teach religion and for communication purpose. The Arabic and Amharic languages belonging to the Semitic family of languages [12], where the words in such languages are formed by modifying the root itself internally and not simply by the concatenation of affixes to word roots. Amharic and Arabic are very rich morphology languages. The current Amharic writing system consists of a core of thirty-three characters (ፊደል, fidel) each of which occurs in a basic form and in six other forms known as orders [1]. The non-basic forms are derived from the basic forms by more-or-less regular modifications. Thus, there are 231 different characters. The seven orders represent syllable combinations consisting of consonant and following vowel. This characteristic according to Abebayehu [13], makes the Amharic writing system a syllabic writing system. A character or a symbol is used to represent a phoneme, which is a combination of a vowel and a consonant. These are written in a unique script that is now supported in Unicode (U+1200 - U+137F) [14].
  • 4. 52 Computer Science & Information Technology (CS & IT) The Arabic alphabet consists of 28 characters or 29 characters if the Hamza is considered as a separate character. It is written from right to left like Persian, Hebrew, unlike many international languages. Three of the Arabic characters appear in different shapes as follows [15][16]: • Hamza (‫)ء‬ is sometimes written :‫,ا‬ ِ‫إ‬ or ‫أ‬ (alif) • Ta marbouta (‫)ة‬ like t in English found atthe end without two dots ( o = ha) • Alifmaqsurah (‫)ى‬ is the character (‫=ي‬ya ) without dots.t The above three characters pose some difficulties in the setting up a CLIR system. Some of Arabic language resources ignore the Hamza and the dots (.) above “ta marbouta” to unite the input and output for these characters. In Arabic there is a whole series of non-alphabetic signs, added above or below the consonant letters to make the reading of the word less ambiguous. Both Arabic and Amharic languages possess translation challenges for many reasons [17][18]; such as Arabic sentences are usually long and punctuation has no or little effect on interpretation of the text. Contextual analysis is important in Arabic and Amharic in order to understand the exact meaning of some words. For example, in Amharic, the word “ገና” can have the meaning of Christmas holiday or waiting something until it happens. Characters are sometimes stretched for justified text, which hinders the exact much for same word. In Arabic, synonyms are very common. For example, “year” has three synonyms in Arabic ‫َام‬‫ع‬، ‫،حول‬ ‫سنة‬ and all are widely used in every day communication. Another challenge in Arabic is the absence of discretization (sometimes called vocalization). Discretization can be defined as a symbol over and underscored letters, which are used to indicate the proper pronunciations as well as for disambiguation purposes. The absence of discretization in Arabic texts poses a real challenge for Arabic natural language processing, As well as for translation, leading to high ambiguity. Though the use of discretization is extremely important for readability and understanding, they don’t appear in most printed media in Arabic regions nor on Arabic Internet web sites. They are visible in religious texts such as Quran, which is fully discretised in order to prevent misinterpretation. Ethiopia has good socio-economic relationships with Arabic countries; they are communicating using the Arabic and Amharic languages. For example, reports sent between Ethiopia and Arabic countries need to be written in both languages, and most of the new and translated religious books are written in both languages by Muslim scholars. Similar to English, a large amount of unstructured documents are available on the net in Arabic and Amharic languages. However, IR tools and techniques are mostly English language oriented, and currently there are several attempts to develop IR tools for Arabic and Amharic language. Many of Internet users who are non-native Arabic speakers can read and understand Arabic documents but they feel uncomfortable to formulate queries in Arabic. This may be either because of their limited vocabulary in Arabic, or because of the possible miss-usage of Arabic words. Different attempts have been made to develop CLIR systems for Amharic-French [19] and Afan Oromo-English [3] languages. Nevertheless, CLIR system is not found for Amharic-Arabic language pair. Development of standard corpus and tools is very essential in order to test the performance of the newly developed CLIR system [20]..
  • 5. Computer Science & Information Technology (CS & IT) 53 The aim of this research work is to develop a prototype of dictionary based Amharic-Arabic CLIR system that enables Amharic and Arabic language users to retrieve both language documents and to examine the ability of the proposed system. We employee query translation strategy, which is more efficient than document translation strategy, because the document translation strategy require overhead cost of translating all documents, especially when new documents are added frequently and not all of the documents are of interest to the users [21]. The remainder of this paper is organized as follows; the review of related works is presented in Section 2 and the proposed CLIR method in Section 3. Section 4 gives the experimental setup and the results and the paper conclude in Section 5. 2. RELATED WORKS Several researchers have studied CLIR works related to different language pairs. However, less work is reported on Amharic and Arabic languages paired with other languages. Some of the prominent works are discussed below Argaw Atelach Alemu, et.al [19], present a dictionary based approach to translate the Amharic queries into French Bags-of-words in the Amharic-French bilingual track at CLEF 2005 using the search engines: SICS and Lucene. Non-content bearing words were removed both before and after the dictionary lookup. TF/IDF values supplemented by a heuristic function was used to remove the stop words from the Amharic queries and two French stop words lists were used to remove stop words from French translations. From the experiments, they found that the SICS search engine performed better than Lucene. Aljlayl et.al [1], empirically evaluated the use of an MT-based approach for query translation in an Arabic-English CLIR system using TREC-7 and TREC-9 topics and collections. The effect of query length on the performance of MT is also investigated to explore how much context is actually required for successful MT processing. A well-formed source query makes the MT system able to provide its best accuracy. Tesfaye Fasika [20], employed a corpus based approach which makes use of phrasal query translation for Amharic-English CLIR. The result of the experimentation is a recall value of 24.8% for translated Amharic queries, 46.3% for Amharic queries and 43.6% for the baseline English queries. Nigussie Eyob [7], have developed a corpus based Afaan Oromo–Amharic CLIR system to enable Afaan Oromo speakers to retrieve Amharic information using Afaan Oromo queries. Documents including news articles, bible, legal documents and proclamations from customs authority were used as parallel corpus. Two experiments were conducted, by allowing only one possible translation to each Afaan Oromo query term and by allowing all possible translations. The first experiment returned a maximum average precision of 81% and 45% for monolingual (Afaan Oromo) queries and bilingual (translated Amharic) queries run respectively. The second experiment showed better result of recall and precision than the first experiment, which is 60% for the bilingual query run, and the result for the monolingual query run remained the same. Mequannint et al. [22], designed a model for an Amharic-English Search Engine and developed a bilingual Web search engine based on the model that enables Web users for finding the information they need in Amharic and English languages. They have identified different language dependent query pre-processing components for query translation and developed a bidirectional dictionary-based translation system, which incorporates a transliteration component to handle proper names, which are often missing in bilingual lexicons. They used an Amharic search engine and an open source English search engine (Nutch) for Web document crawling, indexing,
  • 6. 54 Computer Science & Information Technology (CS & IT) searching, ranking and retrieving. The experimental results showed that the Amharic-English Cross-Lingual Retrieval engine performed 74.12% of its corresponding English monolingual retrieval engine and the English-Amharic Cross-Lingual Retrieval engine performed 78.82% of its corresponding Amharic monolingual retrieval engine. In CLIR, the semantic level of words is crucial. Solving the problem of word sense disambiguation will enhance the effectiveness of CLIR systems. Andres Duque et al [23], studied to choose the best dictionary for Cross Lingual Word Sense Disambiguation (CLWSD). They applied the comparison between different dictionaries in two different frameworks; analysing the potential results of an ideal system using those dictionaries and considering the particular unsupervised CLWSD system Co-occurrence Graph, then analyse the results obtained when using different bilingual dictionaries providing the potential translations. They also developed hybrid system by combining the results provided by a probabilistic dictionary, and those obtained with a Most Frequent Sense (MFS) approach. They have focused on only on English- Spanish cross- lingual disambiguation. The hybrid approach outperforms the results obtained by other unsupervised systems. As Arabic is a relatively widely researched Semitic language and has a number of common properties that share with Amharic, some of the computational linguistic research [1],[19],[24], conducted on Amharic and Arabic languages nowadays recommended customizing and using the tools developed for these languages. While the above researchers has attempted to develop and evaluate Amharic and Arabic paired languages with other languages separately, no research has these two languages paired together. 3. METHODOLOGY In this work, an attempt has been made to design a dictionary based Amharic-Arabic CLIR system, which has indexing and searching tasks. Inverted file indexing structure is used to organize documents to speed up searching. The probabilistic model that attempts to simulate the uncertainty nature of an IR system guides the searching process. Amharic and Arabic documents are pre-processed separately by performing tokenization, normalization, stop word removal, punctuation removal and stemming. Figure 3.1 shows the general architecture of the system, which is adopted from C. Peters et al [25]. Bi-lingual dictionary, which includes the list of Amharic and Arabic translated words is constructed manually and is used to translate Amharic queries to Arabic queries. Binary independent probabilistic information retrieval model is adopted to search the relevant documents from Amharic-Arabic parallel corpus. Probabilistic information retrieval is the estimation of the probability of relevance that a document di will be judged relevant by the user with respect to query q, which is expressed as, P(R|q, di), where, R is the set of relevant documents. Typically, in probabilistic model, based on the query the documents are divided into relevant and irrelevant documents [26]. However, the probability of any document is relevant or irrelevant with respect to users query is initially unknown. Therefore, the probabilistic model needs to guess the relevance at the beginning of search process. The user then observes the first retrieved documents and gives feedback for the system by selecting relevant documents as relevant and irrelevant documents as irrelevant. By collecting relevance feedback data from a few documents, the model can then be applied to estimate the probability of relevance for the remaining documents in the collection. This process is applied iteratively to improve the
  • 7. Computer Science & Information Technology (CS & IT) 55 performance of the system to retrieve more and more relevant documents, which satisfies the users need. Figure 3.1 Dictionary based Amharic-Arabic CLIR system architecture
  • 8. 56 Computer Science & Information Technology (CS & IT) The assumptions made for the uncertainty nature of probability model are; • p(ki|R) is constant for all index terms k (usually, its equal to 0.5) • The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection. These two assumptions will give; where, N is the total number of documents in the collection and ni is the number of documents which contain the index term ki. 4. EXPERIMENTATION AND EVALUATION The Holy Quran available through Tanzile Quran navigator website [27] includes 114 chapters, each containing a minimum of 3 to a maximum of 286 verses in Arabic Amharic languages. In this work, subject to the availability of the number of verses, we have downloaded upto 10 verses from each chapter in Arabic and the corresponding verses in Amharic. Even though complete evaluation process requires the evaluation of both system effectiveness and efficiency, only effectiveness of IR system is taken into consideration to determine the performance of the system for the translated queries. Precision and recall are used to measure the effectiveness of the IR system designed. We used Amharic queries for the retrieval of documents both in Arabic and Amharic languages. In addition to retrieving Amharic documents, the Amharic query is translated into Arabic for retrieving Arabic documents. We used 14 simple queries to test the performance of the system and the results obtained are shown in Table 4.1. The performance of the system on Arabic relevant retrieved documents is much better than that of Amharic documents (i.e., 83.89% precision for Amharic against 52.02% precision for Arabic). When the system is tested by giving queries that has Out of Vocabulary words in the dictionary, its precision is decreased and recall is increased specially for Arabic documents. For example, if we add a word “ለኾነው” (to become) which is not translated correctly or appeared in the dictionary for the first query “የፍርዱ ቀን ባለቤት ለኾነው” (Financed you day of the debt) the word “ለኾነው” (to become) is directly used for searching. Therefore, the number of Amharic non relevant documents increased by highly decreasing the performance of the system. The main hindrance of the system performance is incorrect translation due to unnormalized Arabic words specifically diacritics for mapped with the dictionary words, system that cannot be.
  • 9. Computer Science & Information Technology (CS & IT) 57 Table 4.1 Performance of the proposed system 5. CONCLUSION Multilingual information is required for the countries that have multiple languages and it is vital as the users of the internet throughout the world are ever increasing. We have developed a prototype of dictionary based Amharic-Arabic CLIR system that enables Amharic and Arabic language users to retrieve both language documents and to examine the ability of the proposed system. The effectiveness of our proposed system was evaluated and the performance of the system on Arabic relevant retrieved documents was much better than that of Amharic documents.
  • 10. 58 Computer Science & Information Technology (CS & IT) The main challenges with dictionary-based CLIR are untranslatable words due to the limitation of Amharic Arabic general dictionary, the processing of inflected words, Phrase identification and translation, and lexical ambiguity in Amharic and Arabic language. Even if this research has a vital significance in retrieving the required information from Amharic- Arabic document, some issues need to be further investigated to develop efficient and effective CLIR system. This approach requires an exhaustive and detailed list of mapping of concepts in both languages, which is very difficult to build. REFERENCES [1] M. Aljlayl, O. Frieder, and D. Grossman, “On Arabic-English cross-language information retrieval: A machine translation approach,” in Information Technology: Coding and Computing, 2002. Proceedings. International Conference on, 2002, pp. 2–7. [2] K. Sourabh, “An Extensive Literature Review on CLIR and MT activities in India,” Int. J. Sci. Eng. Res., 2013. [3] D. Bekele, “Afaan Oromo Oromo-English Cross-Lingual Information Retrieval (Clir),” AAU, 2011. [4] D. Kelly, “Methods for evaluating interactive information retrieval systems with users,” Found. Trends Inf. Retr., vol. 3, no. 1—2, pp. 1–224, 2009. [5] J. Cardeñosa, C. Gallardo, and A. Toni, “Multilingual Cross Language Information Retrieval A new approach.” [6] M. Abusalah, J. Tait, and M. Oakes, “Literature Review of Cross Language Information Retrieval,” Comput. Hum., pp. 175–177, 2005. [7] E. Nigussie, “Afaan Oromo--Amharic Cross Lingual Information Retrieval,” AAU, 2013. [8] T. Hedlund, “Dictionary-based cross-language information retrieval: principles, system design and evaluation,” in SIGIR Forum, 2004, vol. 38, no. 1, p. 76. [9] M. R. Warrier and M. S. S. Govilkar, “A SURVEY ON VARIOUS CLIR TECHNIQUES.” [10] D. Zhou, M. Truran, T. Brailsford, V. Wade, and H. Ashman, “Translation techniques in cross- language information retrieval,” ACM Comput. Surv., vol. 45, no. 1, p. 1, 2012. [11] G.-A. Levow, D. W. Oard, and P. Resnik, “Dictionary-based techniques for cross-language information retrieval,” Inf. Process. Manag., vol. 41, no. 3, pp. 523–547, 2005. [12] A. D. Rubin, “The Subgrouping of the Semitic Languages,” Linguist. Lang. Compass, vol. 2, no. 1, pp. 79–102, 2008. [13] S. ABEBAYEHU, “Amharic-English Script Identification in Real-Life Document Images,” aau, 2012. [14] B. Ayalew, “The submorphemic structure of Amharic: toward a phonosemantic analysis,” University of Illinois at Urbana-Champaign, 2013.
  • 11. Computer Science & Information Technology (CS & IT) 59 [15] R. Tsarfaty, “Syntax and Parsing of Semitic Languages,” in Natural Language Processing of Semitic Languages, Springer, 2014, pp. 67–128. [16] H. Ishkewy, H. Harb, and H. Farahat, “Azhary: An arabic lexical ontology,” arXiv Prepr. arXiv1411.1999, 2014. [17] T. Hailemeskel, “Amharic Text Retrieval: An Experiment Using Latent Semantic Indexing (LSI) with Singular Value Decomposition (SVD),” M. Sc. Thesis, Addis Ababa University, Addis Ababa, 2003. [18] F. Ahmed and A. Nurnberger, “Arabic/English word translation disambiguation approach based on na{"i}ve Bayesian classifier,” in Computer Science and Information Technology, 2008. IMCSIT 2008. International Multiconference on, 2008, pp. 331–338. [19] A. A. Argaw, L. Asker, J. Karlgren, M. Sahlgren, and R. Cöster, “Dictionary-based Amharic-French information retrieval,” CEUR Workshop Proc., vol. 1171, 2005. [20] F. Tesfaye, “Phrasal Translation for Amharic English Cross Language Information Retrieval (Clir),” AAU, 2010. [21] M. Adriani, “Using statistical term similarity for sense disambiguation in cross-language information retrieval,” Inf. Retr. Boston., vol. 2, no. 1, pp. 71–82, 2000. [22] M. Munye and S. Atnafu, “Amharic-English bilingual web search engine,” in Proceedings of the International Conference on Management of Emergent Digital EcoSystems, 2012, pp. 32–39. [23] A. Duque, J. Martinez-Romo, and L. Araujo, “Choosing the best dictionary for Cross-Lingual Word Sense Disambiguation,” Knowledge-Based Syst., vol. 81, pp. 65–75, 2015. [24] S. A. L. S. F. Adafre, “Machine Translation for Amharic: Where we are,” Strateg. Dev. Mach. Transl. Minor. Lang., p. 47. [25] C. Peters, M. Braschler, and P. Clough, Multilingual information retrieval: From research to practice. Springer Science & Business Media, 2012. [26] F. Dahak, M. Boughanem, and A. Balla, “A probabilistic model to exploit user expectations in XML information retrieval,” Inf. Process. Manag., 2016. [27] “http://tanzil.net/#trans/am.sadiq.” . AUTHORS Ibrahim Gashaw Kassa, is a Ph.D. candidate at Mangalore University Karnataka State, India since 2016. He graduated in 2006 in Information System from Addis Ababa University, Ethiopia. In 2014, he obtained his master’s degree in Information Technology from University of Gondar, Ethiopia., he serves as a lecturer at University of Gondar from 2009 to May 2016. His research interest is in Cross Language Information Retrieval.
  • 12. 60 Computer Science & Information Technology (CS & IT) Dr. H L Shashirekha is an Associate Professor in the Department of Computer Science, Mangalore University, Mangalore, Karnataka State, India. She completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from University of Mysore. She is a member of Board of Studies and Board of Examiners (PG) in Computer Science, Mangalore University. She has several papers in International Conferences and published several papers in International Journals and Conference Proceedings. Her area of research includes Text Mining and Natural Language Processing. Computer Science & Information Technology (CS & IT) is an Associate Professor in the Department of Computer ersity, Mangalore, Karnataka State, India. She completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from University of Mysore. She is a member of Board of Studies and Board of Examiners (PG) in Computer Science, Mangalore University. She has presented several papers in International Conferences and published several papers in International Journals and Conference Proceedings. Her area of research includes Text Mining and Natural Language Processing.