SlideShare a Scribd company logo
International Journal of Technical Research and Applications e-ISSN: 2320-8163,
www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156
152 | P a g e
A SURVEY ON VARIOUS CLIR TECHNIQUES
Miss Rekha Warrier, Mrs Sharvari S. Govilkar
Department of Computer Engineering,
Pillai Institute of Information &Technology,
Navi Mumbai, India
rekhawarrier827@gmail.com, g_sharvari@rediffmail.com
Abstract- Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
Index Terms— Cross Language Information Retrieval, Ontology,
multilingual, dictionary-based translation, corpora, machine
translation
I. INTRODUCTION
The area of Information Access has evolved to perform many
sophisticated tasks such as the information retrieval, question
answering tasks, summarization, multimedia information
retrieval, text mining and clustering and Web information
retrieval. Cross-language Information Retrieval (CLIR) can be
described at an abstract level as the task of retrieving documents
across languages. This deals with firing queries in one language
and retrieving documents in one or more different languages.
The variants of the IR are:-
1) BLIR(Bi-Lingual Information Retrieval)
2) CLIR(Cross-Lingual Information Retrieval) and
3) MLIR(Multi-Lingual Information Retrieval).
The ability to search and retrieve information in multiple
languages is becoming increasingly important and challenging
in today’s environment. Consequently, multilingual and cross-
lingual (language) information retrieval (MLIR and CLIR)
search engines have received more research attention and are
increasingly being used to retrieve information on the Internet.
Cross-lingual IR has become more important in recent years.
CLIR refers to searching, translating and retrieving information
in different languages, but mainly between a source language
and a target language.
The paper presents a detailed survey of various CLIR
techniques and advantages and limitation of each method.
CLIR techniques classification based on the research fields and
their comparison is discussed in section II. We cite the past
literature in section III. Types of CLIR techniques based on
research fields are discussed in section IV. Finally section V
concludes the paper.
II. CROSS-LINGUAL INFORMATION RETRIEVAL
Cross-language information retrieval (CLIR) is a retrieval
process in which the user presents queries in one language to
retrieve information in another language. CLIR approaches are
decomposed into two research fields:- the first is dictionary-
based approach[bilingual MRD and machine translation (MT)],
and the second is concept driven approach.
In dictionary based query translation the query keywords are
translated to the target language using Machine Readable
Dictionaries (MRD). MRDs are electronic versions of printed
dictionaries, and may be general dictionaries or specific domain
dictionaries or a combination of both. The major problem in the
bilingual dictionary approach is translation ambiguity in
addition to problems of word inflection, problems of translating
word compounds, phrases, proper names, spelling variants and
special terms. MT systems normally attempt to determine the
correct word sense for translation by using context analysis.
Concept driven approaches such as thesauri and multilingual
ontologies bridge the gap between the linguistic term and its
meaning. A Bilingual Thesaurus groups words with similar
meanings in hierarchies (with several levels) of classes and
sections and maps them according to their meanings. However,
the thesaurus does not include the definition of words.
Ontology is a generalized collection of knowledge that will be
used to add a context to search queries by the query expansion,
enabling word sense disambiguation. So the paper focuses
CLIR approach using ontology rather than collecting a
thesaurus.
III. LITERATURE SURVEY
Mustafa abusalah et al[1] reports an experiment to evaluate
a Cross Language Information Retrieval (CLIR) system that
uses an ontology to improve query translation in the travel
domain. The ontology-based approach significantly
outperformed the Machine Readable Dictionary translation
baseline using Mean Average Precision as a metric in a user-
centered experiment.
Fedric.C.Gey and nine researchers[2]had proposed a method
that can potentially acquire all the parallel texts in a web site
using cross-lingual information retrieval (CLIR) techniques
using parallel corpora based technique.
Ari Pirkola, Turid Hedlund et.al[3] reviewed literature on
dictionary-based cross-language information retrieval (CLIR)
and presents CLIR research done at the University of Tampere
(UTA). The structured query model and report findings for
four different language pairs concerning the effectiveness of
query structuring is presented. Pattabhi R. K Rao and Sobha.
L[4] have described how cross lingual information retrieval
can be effectively done between a highly agglutinative
language, Tamil and English, an isolating language. The query
needs to be processed using a morphological analyzer or a
stemmer to obtain the base forms of the given query terms.
Feng YuI, Dequan Zheng et.al[5]have showed that for
improving the effectiveness of cross-lingual information
retrieval (CLIR), a domain ontology knowledge based method
is presented to apply to C-E CLIR. In this study, the domain
ontology knowledge is acquired from both source language
user queries and target documents to select target translation
and re-rank initial retrieval documents set. Manoj Kumar
Chinnakotla et.al[6] have discussed a query translation based
approach using bi-lingual dictionaries. Query words not found
in the dictionary are transliterated using a simple rule based
transliteration approach. Using the above approach, for Hindi,
a Mean Average Precision (MAP) of 0.2366 using title and a
MAP of 0.2952 using title and description are achieved. For
Marathi, a MAP of 0.2163 using title is achieved.
Sujoy Das et.al[7] made an observation that the dictionary
based query translation approach has been widely used by
International Journal of Technical Research and Applications e-ISSN: 2320-8163,
www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156
153 | P a g e
researchers of CLIR. The translation ambiguity and target
polysemy are the two major problems of dictionary based
CLIR. In this paper, the researchers have investigated part of
speech and co-occurrence based disambiguation techniques for
English-Hindi CLIR system. Dinesh Mavaluru and Dr. R.
Shriram [8] had proposed a hybrid model for Telugu English
CLIR system. The bi-lingual ontology is used to convert the
Telugu word to the English word. The overall system has
implemented in Java and the Ontology has built for Telugu
language. The system has tested for accuracy. The Google
search interface was used.
Saurabh Varshney and Jyoti Bajpai [9] have studied the
effect of target polysemy and translation ambiguity in
dictionary based query translation approach for English- Hindi
CLIR system. Nurjannaton Hidayah Rais et.al[10] discusses
research on query translation events in Malay-English Cross-
Language Information Retrieval (CLIR) system. The
researchers have assumed that by improving query translation
accuracy, they can improve the information retrieval
performance.
Debasis Mandal et.al[11] describes the experiment on two
cross-lingual and one monolingual English text retrievals. A
careful analysis of the queries revealed that the queries with
named entities provided better results for all the runs, whereas
the queries without named entities performed very poor due to
poor bilingual lexicons and thus bringing down the overall
performance metrics. S. M. Chaware and Srikanth Rao [12]
discusses an approach to build ontology from relational
database with some additional rules. The ontology can be build
dynamically as per user’s need, which will give overall
knowledge domain to the user. The result shows the complete,
easy and simple way of building ontology from database.
Mustafa Abusalah et.al[13] have reviewed a literature
survey based on CLIR system in short. They have described
previous work in CLIR, current problems in CLIR, and made
recommendations for future work.
IV. TYPES OF CLIR TECHNIQUES
There is an increasing amount of full text material in various
languages available through the Internet and other information
suppliers. Therefore cross-language information retrieval
(CLIR) has become an important new research area. Some
CLIR systems use language resources such as bilingual
dictionaries to translate the user’s original query, while other
systems use machine translation to translate the foreign-
language documents beforehand, enabling them to be retrieved
by the original query.
Based on the first research field which is “Bilingual MRD
and MT”, CLIR system is divided into:-
1] Query Translation
2] Document Translation
Based on second research field which is “Concept based field”,
CLIR system has two approaches:-
1] Multilingual dictionary (Dictionary based).
2] Ontology
CLIR techniques can be classified into different categories
based on translation resources:
1] Dictionary-based CLIR technique (DB-CLIR)
2] Corpora based CLIR technique
3] Machine translator based CLIR technique
4] Ontology based CLIR technique (Concept driven field)
1) Dictionary-based CLIR technique
In dictionary based query translation the query keywords
are translated to the target language using Machine Readable
Dictionaries (MRD) abbreviations in the title or heads unless
they are unavoidable. DB-CLIR is applied using document
translation as well as query translation
i. DB-CLIR using document translation
There are two main strategies in DB-CLIR:-
i) translating the original documents into the language of the
queries,
ii) translating the queries into the language of the documents.
This family of approaches includes all techniques which rely on a
simple machine-readable bilingual dictionary to map the bag of
words query derived from the user request to a semantically
equivalent bag of words representation in the document language.
Using the dictionary based translation is a traditional approach in
cross-lingual IR systems but significant performance degradation
is observed when queries contain words or phrases that do not
appear in the dictionary.
ii. DB-CLIR using query translation
The document translation approach requires that the entire
documents in the collection are translated into the language of
the user request. The approach may require enormous
translation effort and will be expensive. In query translation
approach the query is translated into the documents language
and then monolingual retrieval is performed. The query can be
translated using machine translation system, parallel texts
and/or domain specific corpora, or Machine Readable
Dictionary MRD. Query translation approach is popular among
CLIR community because it is efficient and easily
implemented for relatively short queries.
2) Corpora based CLIR technique
A Corpus is a repository of a collection of natural language
material, such as text, paragraphs, and sentences from one or
many languages. Two types of corpora (plural of “corpus”)
have been used in query translation:-
 Parallel corpora
 Comparable corpora
i. Parallel corpora
Parallel corpora consist of the same text in more than one
language. When retrieving text from a parallel corpus, the query
in this does not need to be translated, since a source language
query can be matched against the source language component
of the corpus, and then the target language component aligned
to it can be easily retrieved.
ii. Comparable Corpora
Comparable corpora contain text in more than one language.
The texts in each language are not translations of each other,
but cover the same topic area, and hence contain an equivalent
vocabulary. A number of statistical techniques can be used to
derive topic-specific (often technical) bilingual dictionaries
from parallel corpora.
The corpora based CLIR technique mainly consists of four
modules: preprocessing, candidate texts retrieval, parallel texts
verification, and duplicate elimination.
i) At first, web pages are inputted into preprocessing module,
texts in each page are extracted and saved with the page’s URL
and the texts’ relative location in the page.
International Journal of Technical Research and Applications e-ISSN: 2320-8163,
www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156
154 | P a g e
ii) Candidate texts retrieval module then builds index for
English texts, and retrieves those English texts mostly similar
to any language text by a cross-lingual information retrieval
model, the retrieval results will contain the wanted parallel
texts.
iii) In the next step, parallel texts verification module selects
those real parallel texts from candidate texts. Finally, duplicate
elimination module checks duplicate texts and removes excess
ones.
Results returned by this type of cross-lingual information
retrieval technique contain a considerable amount of parallel
texts with over 90% precision
3) Machine Translator based CLIR technique
In CLIR, Machine Translation (MT) can be implemented
in two different ways. The first way is to use an MT system to
translate foreign language documents in the corpora into the
language of the user’s query. This is done off-line beforehand.
This approach is not viable for large document collections, or
for collections in which the documents are in numerous
languages.
In the second method of using MT in CLIR, the users
query in the “source” language is translated into the “target”
language (the language of the documents in the stored
collection). The “target” language query is then used to retrieve
“target” language documents using classical IR techniques.
With both methods, the MT stage is separate from the retrieval
stage. An ambiguity problem exists in the MT component,
since the translated query does not necessarily represents the
sense of the original query. For instance, translating the
English query big bank to another language could produce an
inappropriate translation since it is not clear whether “bank”
means the institution or the edge of a river. MT systems
normally attempt to determine the correct word sense for
translation by using context analysis. However, a typical search
engine query lacks context as it consists of a small number of
keywords. MT is more efficient in documents translation as the
context is clearer.
Compared with dictionary or corpus based methods, the
advantage of MT-based CLIR translation lies in that
technologies integrated in MT systems, such as syntactic and
semantic analysis, could help to improve the translation
accuracy
4) Ontology based CLIR technique
This type of technique comes under the field of Concept
driven research. Concept driven approaches such as thesaurus
(multilingual dictionary) and multilingual ontologies bridge the
gap between the linguistic term and its meaning. Ontology is a
hierarchically structured set of terms for describing a domain
that can be used as a skeletal foundation for a knowledge base.
An ontology, together with a set of instances of the classes or
concepts defined, constitutes a knowledge base about the
domain being described. The technique is as follows:-
i) The Ontology based IR system submits the query keywords to
XSL (Extensible style sheet Language) to query the ontologies,
extracting related concepts and concept relations.
ii) Then concepts associated with semantic relations are studied by
the ontology based CLIR system.
iii) They are then identified for query expansion if synonyms were
found, this is all done monolingual, then concepts are translated
into their equivalent concepts in the other language using the
ontology bilingual index.
iv) If the concept was not found in the ontology, the Dictionary is
used to find the relevant translated concepts.
v) The final translated query terms are combined using the
Boolean OR and then matched with the corpora documents.
vi) The results then are ranked depending on many factors
such as the number of matching terms found in each document
and the number of terms occurring in the document.
In this technique, the ontology is represented in XML form so
that concepts, sub-concepts and its relationships can be
mapped easily. Being able to identify the most appropriate
translation results of ontology concepts is crucial in the
ontology translation phase.
Simple approaches have been developed for CLIR by using
multi-lingual dictionary or Word Net. Ontology will be better
choice for CLIR, as it covers the entire context and its
relationships, which will be helpful for both user and system
provider. To acquire knowledge, even if for a small activity,
every time there is need to access entire database. It will
decrease the performance in terms result and also time
consuming. In order to avoid this activity and to improve the
performance, ontology is the best solution. Whenever any
knowledge is required, data about that sub-domain can be
considered, an ontology gives the accurate knowledge.
Ontology is a hierarchically structured set of terms for
describing a domain that can be used as a skeletal foundation
for a knowledge base. According to this definition, the same
ontology can be used for building several KBs, which would
share the same skeleton. These skeletons can be extended by
adding concepts and sub-concepts that cover new areas. Such
ontology will give easy and clear understanding of structure of
ontology and inference mechanisms will become easier.
Following table presents a comparison of all CLIR techniques
based on the research fields.
TABLE 1: COMPARISON OF CLIR TECHNIQUES
Types of
CLIR
technique
Subtypes Concept Advantage Disadva
ntage
App
licat
ion
DB-
CLIR
Using
Document
translation
The
document
translation
approach
requires
that the
entire
documents
in the
collection
are
translated
into the
language of
the user
request.
It relies on
a simple
machine-
readable
bilingual
dictionary
to map the
bag of
words
query
derived
from the
user request
to a
semanticall
y
equivalent
bag of
words
representati
on in the
document
language.
(1)untra
nslatabl
e search
keys due
to the
limitatio
ns of
general
dictiona
ries,
(2) the
processi
ng of
inflected
words,
(3)
phrase
identific
ation
and
translati
on, and
(4)
lexical
ambigui
ty in
Dictio
nary-
Based
Cross-
Langu
age
Infor
matio
n
Retrie
val:
Proble
ms,
Metho
ds,
and
Resear
ch
Findin
gs [1]
International Journal of Technical Research and Applications e-ISSN: 2320-8163,
www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156
155 | P a g e
source
and
target
languag
es
Using
Query
translation
In the
query
translation
approach,
parts of
speech of a
word in the
given
context is
found and
the process
of
translation
and
transliterati
on is
performed.
This
approach
overcomes
translation
ambiguity
and target
polysemy
which are
the major
drawbacks
of DB-
CLIR using
document
translation.
The
queries
with
named
entities
provide
d better
results
whereas
the
queries
without
named
entities
perform
ed very
poor due
to poor
bilingua
l lexicon
1)Hin
di to
Englis
h and
Marat
hi to
Englis
h
CLIR
Evalu
ation[
3]
2)
Query
Transl
ation
Archit
ecture
for
Malay
-
Englis
h
CLIR
[6]
Types of
CLIR
technique
Subtypes Concept Advantage Disadva
ntage
App
licat
ion
Corpora
based
- Each text
in source
language is
associated
with the
texts
potentially
parallel to
it in a
CLIR
process and
then each
text pair is
verified by
a pattern
based
algorithm.
The
estimation
of
translation
matrix
generated
during the
transformati
on of query
language
can be done
efficiently
by reducing
the
document
space.
The results
of these
estimations
are
promising.
Corpora
based
methods
suffers
lack of
resource
s.
Parallel
corpora
are not
always
readily
availabl
e and
those
that are
availabl
e tend to
be
relativel
y small
or to
cover
only a
small
number
of
subjects.
The
TEC-
2001:
Cross
Langu
age
Infor
matio
n
Retrie
val
Track
[2]
MT-
based
Using
Query
translation
A query
translation
is
conducted
with the
degraded
MT
systems
Techn
ologies
integrated
in MT
systems,
such as
syntactic
and
An
ambigui
ty
problem
exists in
the MT
compon
ent,
Literat
ure
review
of
Cross
Langu
age
Infor
and
translated
queries of
varying
quality are
obtained.
Then the
translated
queries are
submitted
to the IR
system and
performanc
e is
evaluated.
semantic
analysis,
could help
to improve
the
translation
accuracy
since the
translate
d query
does not
necessar
ily
represen
ts the
sense of
the
original
query.
matio
n
Retrie
val[13
]
Ontolog
y based
- The
Ontology
based IR
system
submits the
query
keywords
to XSL
(Extensible
style sheet
Language)
to query
the
ontologies,
extracting
related
concepts
and
concept
relations.
1) Since the
ontology is
represented
in the form
of XML,
concepts,
sub-
concepts
and
relationship
s can be
mapped
easily.
2) Enables
reuse of
domain
knowledge
1)
Increase
s the
creation
difficult
y.
2)Visual
ization
problem
s
The size
of the
resource
(ontolog
y)
is
inversel
y
proporti
onal to
its
specifici
ty.
Chine
se-
Englis
h
CLIR
based
on
Domai
n
Ontol
ogy
Knowl
edge[5
]
Ontol
ogy
Appro
ach
for
Cross-
Langu
age
Infor
matio
n
Retrie
val
[12]
V.CONCLUSION
The Internet has paved opportunities for increasing multi-
lingual information exchange and retrieval in future. Cross-
lingual IR provides new paradigms in searching documents
through myriad varieties of languages across the world and it
can be the baseline for searching not only among two languages
but also in multiple. Creating accurate metadata in different
languages in documents or good translation of key information
in documents can help improve the quality of the index and
retrieval. After the evaluation of both the pure dictionary and
the ontology systems, the ontology based system scored higher
in terms of precision.
In future development ontology will be enhanced and
extended by using annotation tools to align new concepts to the
ontology and then test it again with the dictionary system. Other
areas for investigation include ease of use, the use of relevance
feedback, the effect of more extensive use of concept relations
and possibly experiments with larger data sets. This paper
discussed different types of CLIR techniques and advantages
and disadvantages of each techniques.
International Journal of Technical Research and Applications e-ISSN: 2320-8163,
www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156
156 | P a g e
REFERENCES
[1] Mustafa Abusalah, John Tait and Micheal Oakes “Cross
Language Information Retrieval using Multilingual Ontology
as Translation and Query Expansion Base” September 2009.
[2] F. C. Gey, “The TEC-2001: Cross Language Information
Retrieval Track,” 2001.
[3] Ari Pirkola, Turid Hedlund, Heikki Keskustalo, and Kalervo
Järvelin, “Dictionary Based Cross Language Information
Retrieval: Problems, Methods, and Research Finding”
September 2001, Volume 4,pp 209-230.
[4] Pattabhi R.K Rao and Sobha. L,“Cross Lingual Information
Retrieval Track”,AU-KBC Research Centre, MIT Campus,
Chennai,2010
[5] Feng YuI, Dequan Zheng and Tiejun Zhao, Sheng Li, Hao
Yu, “Chinese-English Cross-Lingual Information Retrieval
based on Domain Ontology Knowledge”,2010
[6] Manoj Kumar Chinnakotla, Sagar Ranadive, Om P.
Damani, and Pushpak Bhattacharyya, “Hindi to English and
Marathi to English Cross Language Information
Retrieval Evaluation”, Department of Computer Science
and Engineering, IIT Bombay, India, 2008
[7] Sujoy Das, Anurag Seetha , M. Kumar and J. L. Rana,
“Disambiguation Strategies for English-Hindi Cross
Language Information Retrieval System”,2009
[8] Dinesh Mavaluru Dr. R. Shriram, “ Telugu English Cross
Language Information Retrieval: A Case Study ”, 2013
[9] Saurabh Varshney and Jyoti Bajpai, “Improving
performance of English-Hindi cross language information
retrieval using transliteration of query terms”, 2013
[10] Nurjannaton Hidayah Rais, Muhamad Taufik Abdullah,
Rabiah Abdul Kadir, “Query Translation Architecture for
Malay-English Cross-Language Information Retrieval
System”,2010
[11] Debasis Mandal, Sandipan Dandapat, Mayank Gupta,
Pratyush Banerjee, Sudeshna Sarkar, “Bengali and Hindi to
English Cross language Text Retrieval under Limited
Resources”,2008
[12] S.M.Chaware and Srikanth Rao, “Ontology approach for
cross language Information Retrieval”,2011
[13] Mustafa Abusalah, John Tait and Micheal Oakes “Literature
review of Cross language information retrieval”,2007

More Related Content

PDF
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
PDF
Cross language information retrieval in indian
PDF
A Review on the Cross and Multilingual Information Retrieval
PDF
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
PDF
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Cross language information retrieval in indian
A Review on the Cross and Multilingual Information Retrieval
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
Survey on Indian CLIR and MT systems in Marathi Language
Keywords- Based on Arabic Information Retrieval Using Light Stemmer

What's hot (20)

PDF
Improving performance of english hindi cross language information retrieval u...
PDF
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
Design and Development of a Malayalam to English Translator- A Transfer Based...
PDF
Ny3424442448
PDF
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
PDF
A Survey of Arabic Text Classification Models
PDF
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
PDF
The effect of training set size in authorship attribution: application on sho...
PDF
C1803021622
PDF
A New Concept Extraction Method for Ontology Construction From Arabic Text
PDF
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
PDF
Developing an arabic plagiarism detection corpus
PDF
04. 9990 16097-1-ed (edited arf)
PDF
A New Approach to Parts of Speech Tagging in Malayalam
PDF
Implementation of Urdu Probabilistic Parser
PDF
Myanmar Named Entity Recognition with Hidden Markov Model
PDF
Hybrid approaches for automatic vowelization of arabic texts
PDF
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
DOCX
Multilingualism in Information Retrieval System
PDF
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
Improving performance of english hindi cross language information retrieval u...
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
Design and Development of a Malayalam to English Translator- A Transfer Based...
Ny3424442448
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
A Survey of Arabic Text Classification Models
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
The effect of training set size in authorship attribution: application on sho...
C1803021622
A New Concept Extraction Method for Ontology Construction From Arabic Text
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
Developing an arabic plagiarism detection corpus
04. 9990 16097-1-ed (edited arf)
A New Approach to Parts of Speech Tagging in Malayalam
Implementation of Urdu Probabilistic Parser
Myanmar Named Entity Recognition with Hidden Markov Model
Hybrid approaches for automatic vowelization of arabic texts
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
Multilingualism in Information Retrieval System
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
Ad

Viewers also liked (20)

PDF
STATIC AND MODAL ANALYSIS OF LEAF SPRING USING FEA
PDF
EFFECT OF PRUNING AND SKIFFING ON GROWTH AND PRODUCTIVITY OF DARJEELING TEA (...
PDF
STUDY OF CARBOHYDRATE METABOLISM IN SEVERE ACUTE MALNUTRITION AND CORRELATION...
PDF
ERROR ESTIMATION IN DEVELOPING GIS MAPS USING DIFFERENT INPUT METHODS OF LAND...
PDF
AN INSIDE LOOK IN THE ELECTRICAL STRUCTURE OF THE BATTERY MANAGEMENT SYSTEM T...
PDF
STATE OF ART IN MODERN RESISTANCE SPOT WELDING
PDF
COMPARATIVE STUDY OF BELL PEPPER ON THE ASPECTS OF THEIR APPROXIMATE ANALYSIS...
PDF
PREVALENCE EVALUATION OF DIABETIC RETINOPATHY
PDF
EFFICIENT EMBEDDED SURVEILLANCE SYSTEM WITH AUTO IMAGE CAPTURING AND EMAIL SE...
PDF
MECHANISMS OF PHOTOPERIOD IN REGULATION OF RICE FLOWERING
PDF
THE FUNDAMENTAL ANALYSIS-THE BEST WAY TO EVALUATE EQUITY SHARES
PDF
SCANNING ELECTRON MICROSCOPE OF DISINFECTANTS EXPOSED PSEUDOMONAS AERUGINOSA ...
PDF
CONVECTIVE HEAT TRANSFER ENHANCEMENTS IN TUBE USING LOUVERED STRIP INSERT
PDF
UNDERSTANDING ENTREPRENEURSHIP: IMPACT OF OVERCONFIDENCE ON ENTREPRENEURSHIP
PDF
IDENTIFICATION OF SUPPLY CHAIN MANAGEMENT PROBLEMS: A REVIEW
PDF
IMPROVEMENT OF SUPPLY CHAIN MANAGEMENT BY MATHEMATICAL PROGRAMMING APPROACH
PDF
EFFECT OF DIFFERENT CONCENTRATIONS OF AUXINS AND COMBINATION WITH KINETIN ON ...
PDF
OXIDATIVE STRESS AND NITRIC OXIDE: A SIGNIFICANT MARKER IN CORONARY ARTERY DI...
STATIC AND MODAL ANALYSIS OF LEAF SPRING USING FEA
EFFECT OF PRUNING AND SKIFFING ON GROWTH AND PRODUCTIVITY OF DARJEELING TEA (...
STUDY OF CARBOHYDRATE METABOLISM IN SEVERE ACUTE MALNUTRITION AND CORRELATION...
ERROR ESTIMATION IN DEVELOPING GIS MAPS USING DIFFERENT INPUT METHODS OF LAND...
AN INSIDE LOOK IN THE ELECTRICAL STRUCTURE OF THE BATTERY MANAGEMENT SYSTEM T...
STATE OF ART IN MODERN RESISTANCE SPOT WELDING
COMPARATIVE STUDY OF BELL PEPPER ON THE ASPECTS OF THEIR APPROXIMATE ANALYSIS...
PREVALENCE EVALUATION OF DIABETIC RETINOPATHY
EFFICIENT EMBEDDED SURVEILLANCE SYSTEM WITH AUTO IMAGE CAPTURING AND EMAIL SE...
MECHANISMS OF PHOTOPERIOD IN REGULATION OF RICE FLOWERING
THE FUNDAMENTAL ANALYSIS-THE BEST WAY TO EVALUATE EQUITY SHARES
SCANNING ELECTRON MICROSCOPE OF DISINFECTANTS EXPOSED PSEUDOMONAS AERUGINOSA ...
CONVECTIVE HEAT TRANSFER ENHANCEMENTS IN TUBE USING LOUVERED STRIP INSERT
UNDERSTANDING ENTREPRENEURSHIP: IMPACT OF OVERCONFIDENCE ON ENTREPRENEURSHIP
IDENTIFICATION OF SUPPLY CHAIN MANAGEMENT PROBLEMS: A REVIEW
IMPROVEMENT OF SUPPLY CHAIN MANAGEMENT BY MATHEMATICAL PROGRAMMING APPROACH
EFFECT OF DIFFERENT CONCENTRATIONS OF AUXINS AND COMBINATION WITH KINETIN ON ...
OXIDATIVE STRESS AND NITRIC OXIDE: A SIGNIFICANT MARKER IN CORONARY ARTERY DI...
Ad

Similar to A SURVEY ON VARIOUS CLIR TECHNIQUES (20)

PDF
QUrdPro: Query processing system for Urdu Language
PDF
Cross Lingual Information Retrieval Using Search Engine and Data Mining
PDF
Fq2510361043
PDF
Arabic text categorization algorithm using vector evaluation method
PDF
A Novel approach for Document Clustering using Concept Extraction
PDF
Transliteration and translation of the Hindi language using integrated domain...
PDF
An unsupervised approach to develop ir system the case of urdu
PDF
INTELLIGENT QUERY PROCESSING IN MALAYALAM
PDF
Designing Cross-Language Information Retrieval System using various Technique...
PDF
Attentive_fine-tuning_of_Transformers_for_Translat.pdf
PDF
A language independent approach to develop urduir system
PDF
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
PDF
Named Entity Recognition System for Hindi Language: A Hybrid Approach
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
PDF
Extraction and Retrieval of Web based Content in Web Engineering
PDF
Survey On Building A Database Driven Reverse Dictionary
PDF
Text databases and information retrieval
PDF
Hf3413291335
QUrdPro: Query processing system for Urdu Language
Cross Lingual Information Retrieval Using Search Engine and Data Mining
Fq2510361043
Arabic text categorization algorithm using vector evaluation method
A Novel approach for Document Clustering using Concept Extraction
Transliteration and translation of the Hindi language using integrated domain...
An unsupervised approach to develop ir system the case of urdu
INTELLIGENT QUERY PROCESSING IN MALAYALAM
Designing Cross-Language Information Retrieval System using various Technique...
Attentive_fine-tuning_of_Transformers_for_Translat.pdf
A language independent approach to develop urduir system
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
Named Entity Recognition System for Hindi Language: A Hybrid Approach
Performance Evaluation of Query Processing Techniques in Information Retrieval
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
Extraction and Retrieval of Web based Content in Web Engineering
Survey On Building A Database Driven Reverse Dictionary
Text databases and information retrieval
Hf3413291335

More from International Journal of Technical Research & Application (20)

PDF
STUDY & PERFORMANCE OF METAL ON METAL HIP IMPLANTS: A REVIEW
PDF
EXPONENTIAL SMOOTHING OF POSTPONEMENT RATES IN OPERATION THEATRES OF ADVANCED...
PDF
POSTPONEMENT OF SCHEDULED GENERAL SURGERIES IN A TERTIARY CARE HOSPITAL - A T...
PDF
STUDY OF NANO-SYSTEMS FOR COMPUTER SIMULATIONS
PDF
ENERGY GAP INVESTIGATION AND CHARACTERIZATION OF KESTERITE CU2ZNSNS4 THIN FIL...
PDF
POD-PWM BASED CAPACITOR CLAMPED MULTILEVEL INVERTER
PDF
DIGITAL COMPRESSING OF A BPCM SIGNAL ACCORDING TO BARKER CODE USING FPGA
PDF
MODELLING THE IMPACT OF FLOODING USING GEOGRAPHIC INFORMATION SYSTEM AND REMO...
PDF
AN EXPERIMENTAL STUDY ON SEPARATION OF WATER FROM THE ATMOSPHERIC AIR
PDF
LI-ION BATTERY TESTING FROM MANUFACTURING TO OPERATION PROCESS
PDF
QUALITATIVE RISK ASSESSMENT AND MITIGATION MEASURES FOR REAL ESTATE PROJECTS ...
PDF
SCOPE OF REPLACING FINE AGGREGATE WITH COPPER SLAG IN CONCRETE- A REVIEW
PDF
IMPLEMENTATION OF METHODS FOR TRANSACTION IN SECURE ONLINE BANKING
PDF
EFFECT OF TRANS-SEPTAL SUTURE TECHNIQUE VERSUS NASAL PACKING AFTER SEPTOPLASTY
PDF
EVALUATION OF DRAINAGE WATER QUALITY FOR IRRIGATION BY INTEGRATION BETWEEN IR...
PDF
THE CONSTRUCTION PROCEDURE AND ADVANTAGE OF THE RAIL CABLE-LIFTING CONSTRUCTI...
PDF
TIME EFFICIENT BAYLIS-HILLMAN REACTION ON STEROIDAL NUCLEUS OF WITHAFERIN-A T...
PDF
A STUDY ON THE FRESH PROPERTIES OF SCC WITH FLY ASH
PDF
OPEN LOOP ANALYSIS OF CASCADED HBRIDGE MULTILEVEL INVERTER USING PDPWM FOR PH...
PDF
PHYSICO-CHEMICAL AND BACTERIOLOGICAL ASSESSMENT OF RIVER MUDZIRA WATER IN MUB...
STUDY & PERFORMANCE OF METAL ON METAL HIP IMPLANTS: A REVIEW
EXPONENTIAL SMOOTHING OF POSTPONEMENT RATES IN OPERATION THEATRES OF ADVANCED...
POSTPONEMENT OF SCHEDULED GENERAL SURGERIES IN A TERTIARY CARE HOSPITAL - A T...
STUDY OF NANO-SYSTEMS FOR COMPUTER SIMULATIONS
ENERGY GAP INVESTIGATION AND CHARACTERIZATION OF KESTERITE CU2ZNSNS4 THIN FIL...
POD-PWM BASED CAPACITOR CLAMPED MULTILEVEL INVERTER
DIGITAL COMPRESSING OF A BPCM SIGNAL ACCORDING TO BARKER CODE USING FPGA
MODELLING THE IMPACT OF FLOODING USING GEOGRAPHIC INFORMATION SYSTEM AND REMO...
AN EXPERIMENTAL STUDY ON SEPARATION OF WATER FROM THE ATMOSPHERIC AIR
LI-ION BATTERY TESTING FROM MANUFACTURING TO OPERATION PROCESS
QUALITATIVE RISK ASSESSMENT AND MITIGATION MEASURES FOR REAL ESTATE PROJECTS ...
SCOPE OF REPLACING FINE AGGREGATE WITH COPPER SLAG IN CONCRETE- A REVIEW
IMPLEMENTATION OF METHODS FOR TRANSACTION IN SECURE ONLINE BANKING
EFFECT OF TRANS-SEPTAL SUTURE TECHNIQUE VERSUS NASAL PACKING AFTER SEPTOPLASTY
EVALUATION OF DRAINAGE WATER QUALITY FOR IRRIGATION BY INTEGRATION BETWEEN IR...
THE CONSTRUCTION PROCEDURE AND ADVANTAGE OF THE RAIL CABLE-LIFTING CONSTRUCTI...
TIME EFFICIENT BAYLIS-HILLMAN REACTION ON STEROIDAL NUCLEUS OF WITHAFERIN-A T...
A STUDY ON THE FRESH PROPERTIES OF SCC WITH FLY ASH
OPEN LOOP ANALYSIS OF CASCADED HBRIDGE MULTILEVEL INVERTER USING PDPWM FOR PH...
PHYSICO-CHEMICAL AND BACTERIOLOGICAL ASSESSMENT OF RIVER MUDZIRA WATER IN MUB...

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
PPT on Performance Review to get promotions
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Welding lecture in detail for understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Geodesy 1.pptx...............................................
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT on Performance Review to get promotions
Automation-in-Manufacturing-Chapter-Introduction.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Lecture Notes Electrical Wiring System Components
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech
Welding lecture in detail for understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CYBER-CRIMES AND SECURITY A guide to understanding
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Geodesy 1.pptx...............................................
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS

A SURVEY ON VARIOUS CLIR TECHNIQUES

  • 1. International Journal of Technical Research and Applications e-ISSN: 2320-8163, www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156 152 | P a g e A SURVEY ON VARIOUS CLIR TECHNIQUES Miss Rekha Warrier, Mrs Sharvari S. Govilkar Department of Computer Engineering, Pillai Institute of Information &Technology, Navi Mumbai, India rekhawarrier827@gmail.com, g_sharvari@rediffmail.com Abstract- Information retrieval (IR) system aims to retrieve relevant documents to a user query where the query is a set of keywords. Cross-language information retrieval (CLIR) is a retrieval process in which the user fires queries in one language to retrieve information from another language. The growing requirement on the Internet for users to access information expressed in language other than their own has led to Cross Language Information Retrieval (CLIR) becoming established as a major topic in IR. Index Terms— Cross Language Information Retrieval, Ontology, multilingual, dictionary-based translation, corpora, machine translation I. INTRODUCTION The area of Information Access has evolved to perform many sophisticated tasks such as the information retrieval, question answering tasks, summarization, multimedia information retrieval, text mining and clustering and Web information retrieval. Cross-language Information Retrieval (CLIR) can be described at an abstract level as the task of retrieving documents across languages. This deals with firing queries in one language and retrieving documents in one or more different languages. The variants of the IR are:- 1) BLIR(Bi-Lingual Information Retrieval) 2) CLIR(Cross-Lingual Information Retrieval) and 3) MLIR(Multi-Lingual Information Retrieval). The ability to search and retrieve information in multiple languages is becoming increasingly important and challenging in today’s environment. Consequently, multilingual and cross- lingual (language) information retrieval (MLIR and CLIR) search engines have received more research attention and are increasingly being used to retrieve information on the Internet. Cross-lingual IR has become more important in recent years. CLIR refers to searching, translating and retrieving information in different languages, but mainly between a source language and a target language. The paper presents a detailed survey of various CLIR techniques and advantages and limitation of each method. CLIR techniques classification based on the research fields and their comparison is discussed in section II. We cite the past literature in section III. Types of CLIR techniques based on research fields are discussed in section IV. Finally section V concludes the paper. II. CROSS-LINGUAL INFORMATION RETRIEVAL Cross-language information retrieval (CLIR) is a retrieval process in which the user presents queries in one language to retrieve information in another language. CLIR approaches are decomposed into two research fields:- the first is dictionary- based approach[bilingual MRD and machine translation (MT)], and the second is concept driven approach. In dictionary based query translation the query keywords are translated to the target language using Machine Readable Dictionaries (MRD). MRDs are electronic versions of printed dictionaries, and may be general dictionaries or specific domain dictionaries or a combination of both. The major problem in the bilingual dictionary approach is translation ambiguity in addition to problems of word inflection, problems of translating word compounds, phrases, proper names, spelling variants and special terms. MT systems normally attempt to determine the correct word sense for translation by using context analysis. Concept driven approaches such as thesauri and multilingual ontologies bridge the gap between the linguistic term and its meaning. A Bilingual Thesaurus groups words with similar meanings in hierarchies (with several levels) of classes and sections and maps them according to their meanings. However, the thesaurus does not include the definition of words. Ontology is a generalized collection of knowledge that will be used to add a context to search queries by the query expansion, enabling word sense disambiguation. So the paper focuses CLIR approach using ontology rather than collecting a thesaurus. III. LITERATURE SURVEY Mustafa abusalah et al[1] reports an experiment to evaluate a Cross Language Information Retrieval (CLIR) system that uses an ontology to improve query translation in the travel domain. The ontology-based approach significantly outperformed the Machine Readable Dictionary translation baseline using Mean Average Precision as a metric in a user- centered experiment. Fedric.C.Gey and nine researchers[2]had proposed a method that can potentially acquire all the parallel texts in a web site using cross-lingual information retrieval (CLIR) techniques using parallel corpora based technique. Ari Pirkola, Turid Hedlund et.al[3] reviewed literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The structured query model and report findings for four different language pairs concerning the effectiveness of query structuring is presented. Pattabhi R. K Rao and Sobha. L[4] have described how cross lingual information retrieval can be effectively done between a highly agglutinative language, Tamil and English, an isolating language. The query needs to be processed using a morphological analyzer or a stemmer to obtain the base forms of the given query terms. Feng YuI, Dequan Zheng et.al[5]have showed that for improving the effectiveness of cross-lingual information retrieval (CLIR), a domain ontology knowledge based method is presented to apply to C-E CLIR. In this study, the domain ontology knowledge is acquired from both source language user queries and target documents to select target translation and re-rank initial retrieval documents set. Manoj Kumar Chinnakotla et.al[6] have discussed a query translation based approach using bi-lingual dictionaries. Query words not found in the dictionary are transliterated using a simple rule based transliteration approach. Using the above approach, for Hindi, a Mean Average Precision (MAP) of 0.2366 using title and a MAP of 0.2952 using title and description are achieved. For Marathi, a MAP of 0.2163 using title is achieved. Sujoy Das et.al[7] made an observation that the dictionary based query translation approach has been widely used by
  • 2. International Journal of Technical Research and Applications e-ISSN: 2320-8163, www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156 153 | P a g e researchers of CLIR. The translation ambiguity and target polysemy are the two major problems of dictionary based CLIR. In this paper, the researchers have investigated part of speech and co-occurrence based disambiguation techniques for English-Hindi CLIR system. Dinesh Mavaluru and Dr. R. Shriram [8] had proposed a hybrid model for Telugu English CLIR system. The bi-lingual ontology is used to convert the Telugu word to the English word. The overall system has implemented in Java and the Ontology has built for Telugu language. The system has tested for accuracy. The Google search interface was used. Saurabh Varshney and Jyoti Bajpai [9] have studied the effect of target polysemy and translation ambiguity in dictionary based query translation approach for English- Hindi CLIR system. Nurjannaton Hidayah Rais et.al[10] discusses research on query translation events in Malay-English Cross- Language Information Retrieval (CLIR) system. The researchers have assumed that by improving query translation accuracy, they can improve the information retrieval performance. Debasis Mandal et.al[11] describes the experiment on two cross-lingual and one monolingual English text retrievals. A careful analysis of the queries revealed that the queries with named entities provided better results for all the runs, whereas the queries without named entities performed very poor due to poor bilingual lexicons and thus bringing down the overall performance metrics. S. M. Chaware and Srikanth Rao [12] discusses an approach to build ontology from relational database with some additional rules. The ontology can be build dynamically as per user’s need, which will give overall knowledge domain to the user. The result shows the complete, easy and simple way of building ontology from database. Mustafa Abusalah et.al[13] have reviewed a literature survey based on CLIR system in short. They have described previous work in CLIR, current problems in CLIR, and made recommendations for future work. IV. TYPES OF CLIR TECHNIQUES There is an increasing amount of full text material in various languages available through the Internet and other information suppliers. Therefore cross-language information retrieval (CLIR) has become an important new research area. Some CLIR systems use language resources such as bilingual dictionaries to translate the user’s original query, while other systems use machine translation to translate the foreign- language documents beforehand, enabling them to be retrieved by the original query. Based on the first research field which is “Bilingual MRD and MT”, CLIR system is divided into:- 1] Query Translation 2] Document Translation Based on second research field which is “Concept based field”, CLIR system has two approaches:- 1] Multilingual dictionary (Dictionary based). 2] Ontology CLIR techniques can be classified into different categories based on translation resources: 1] Dictionary-based CLIR technique (DB-CLIR) 2] Corpora based CLIR technique 3] Machine translator based CLIR technique 4] Ontology based CLIR technique (Concept driven field) 1) Dictionary-based CLIR technique In dictionary based query translation the query keywords are translated to the target language using Machine Readable Dictionaries (MRD) abbreviations in the title or heads unless they are unavoidable. DB-CLIR is applied using document translation as well as query translation i. DB-CLIR using document translation There are two main strategies in DB-CLIR:- i) translating the original documents into the language of the queries, ii) translating the queries into the language of the documents. This family of approaches includes all techniques which rely on a simple machine-readable bilingual dictionary to map the bag of words query derived from the user request to a semantically equivalent bag of words representation in the document language. Using the dictionary based translation is a traditional approach in cross-lingual IR systems but significant performance degradation is observed when queries contain words or phrases that do not appear in the dictionary. ii. DB-CLIR using query translation The document translation approach requires that the entire documents in the collection are translated into the language of the user request. The approach may require enormous translation effort and will be expensive. In query translation approach the query is translated into the documents language and then monolingual retrieval is performed. The query can be translated using machine translation system, parallel texts and/or domain specific corpora, or Machine Readable Dictionary MRD. Query translation approach is popular among CLIR community because it is efficient and easily implemented for relatively short queries. 2) Corpora based CLIR technique A Corpus is a repository of a collection of natural language material, such as text, paragraphs, and sentences from one or many languages. Two types of corpora (plural of “corpus”) have been used in query translation:-  Parallel corpora  Comparable corpora i. Parallel corpora Parallel corpora consist of the same text in more than one language. When retrieving text from a parallel corpus, the query in this does not need to be translated, since a source language query can be matched against the source language component of the corpus, and then the target language component aligned to it can be easily retrieved. ii. Comparable Corpora Comparable corpora contain text in more than one language. The texts in each language are not translations of each other, but cover the same topic area, and hence contain an equivalent vocabulary. A number of statistical techniques can be used to derive topic-specific (often technical) bilingual dictionaries from parallel corpora. The corpora based CLIR technique mainly consists of four modules: preprocessing, candidate texts retrieval, parallel texts verification, and duplicate elimination. i) At first, web pages are inputted into preprocessing module, texts in each page are extracted and saved with the page’s URL and the texts’ relative location in the page.
  • 3. International Journal of Technical Research and Applications e-ISSN: 2320-8163, www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156 154 | P a g e ii) Candidate texts retrieval module then builds index for English texts, and retrieves those English texts mostly similar to any language text by a cross-lingual information retrieval model, the retrieval results will contain the wanted parallel texts. iii) In the next step, parallel texts verification module selects those real parallel texts from candidate texts. Finally, duplicate elimination module checks duplicate texts and removes excess ones. Results returned by this type of cross-lingual information retrieval technique contain a considerable amount of parallel texts with over 90% precision 3) Machine Translator based CLIR technique In CLIR, Machine Translation (MT) can be implemented in two different ways. The first way is to use an MT system to translate foreign language documents in the corpora into the language of the user’s query. This is done off-line beforehand. This approach is not viable for large document collections, or for collections in which the documents are in numerous languages. In the second method of using MT in CLIR, the users query in the “source” language is translated into the “target” language (the language of the documents in the stored collection). The “target” language query is then used to retrieve “target” language documents using classical IR techniques. With both methods, the MT stage is separate from the retrieval stage. An ambiguity problem exists in the MT component, since the translated query does not necessarily represents the sense of the original query. For instance, translating the English query big bank to another language could produce an inappropriate translation since it is not clear whether “bank” means the institution or the edge of a river. MT systems normally attempt to determine the correct word sense for translation by using context analysis. However, a typical search engine query lacks context as it consists of a small number of keywords. MT is more efficient in documents translation as the context is clearer. Compared with dictionary or corpus based methods, the advantage of MT-based CLIR translation lies in that technologies integrated in MT systems, such as syntactic and semantic analysis, could help to improve the translation accuracy 4) Ontology based CLIR technique This type of technique comes under the field of Concept driven research. Concept driven approaches such as thesaurus (multilingual dictionary) and multilingual ontologies bridge the gap between the linguistic term and its meaning. Ontology is a hierarchically structured set of terms for describing a domain that can be used as a skeletal foundation for a knowledge base. An ontology, together with a set of instances of the classes or concepts defined, constitutes a knowledge base about the domain being described. The technique is as follows:- i) The Ontology based IR system submits the query keywords to XSL (Extensible style sheet Language) to query the ontologies, extracting related concepts and concept relations. ii) Then concepts associated with semantic relations are studied by the ontology based CLIR system. iii) They are then identified for query expansion if synonyms were found, this is all done monolingual, then concepts are translated into their equivalent concepts in the other language using the ontology bilingual index. iv) If the concept was not found in the ontology, the Dictionary is used to find the relevant translated concepts. v) The final translated query terms are combined using the Boolean OR and then matched with the corpora documents. vi) The results then are ranked depending on many factors such as the number of matching terms found in each document and the number of terms occurring in the document. In this technique, the ontology is represented in XML form so that concepts, sub-concepts and its relationships can be mapped easily. Being able to identify the most appropriate translation results of ontology concepts is crucial in the ontology translation phase. Simple approaches have been developed for CLIR by using multi-lingual dictionary or Word Net. Ontology will be better choice for CLIR, as it covers the entire context and its relationships, which will be helpful for both user and system provider. To acquire knowledge, even if for a small activity, every time there is need to access entire database. It will decrease the performance in terms result and also time consuming. In order to avoid this activity and to improve the performance, ontology is the best solution. Whenever any knowledge is required, data about that sub-domain can be considered, an ontology gives the accurate knowledge. Ontology is a hierarchically structured set of terms for describing a domain that can be used as a skeletal foundation for a knowledge base. According to this definition, the same ontology can be used for building several KBs, which would share the same skeleton. These skeletons can be extended by adding concepts and sub-concepts that cover new areas. Such ontology will give easy and clear understanding of structure of ontology and inference mechanisms will become easier. Following table presents a comparison of all CLIR techniques based on the research fields. TABLE 1: COMPARISON OF CLIR TECHNIQUES Types of CLIR technique Subtypes Concept Advantage Disadva ntage App licat ion DB- CLIR Using Document translation The document translation approach requires that the entire documents in the collection are translated into the language of the user request. It relies on a simple machine- readable bilingual dictionary to map the bag of words query derived from the user request to a semanticall y equivalent bag of words representati on in the document language. (1)untra nslatabl e search keys due to the limitatio ns of general dictiona ries, (2) the processi ng of inflected words, (3) phrase identific ation and translati on, and (4) lexical ambigui ty in Dictio nary- Based Cross- Langu age Infor matio n Retrie val: Proble ms, Metho ds, and Resear ch Findin gs [1]
  • 4. International Journal of Technical Research and Applications e-ISSN: 2320-8163, www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156 155 | P a g e source and target languag es Using Query translation In the query translation approach, parts of speech of a word in the given context is found and the process of translation and transliterati on is performed. This approach overcomes translation ambiguity and target polysemy which are the major drawbacks of DB- CLIR using document translation. The queries with named entities provide d better results whereas the queries without named entities perform ed very poor due to poor bilingua l lexicon 1)Hin di to Englis h and Marat hi to Englis h CLIR Evalu ation[ 3] 2) Query Transl ation Archit ecture for Malay - Englis h CLIR [6] Types of CLIR technique Subtypes Concept Advantage Disadva ntage App licat ion Corpora based - Each text in source language is associated with the texts potentially parallel to it in a CLIR process and then each text pair is verified by a pattern based algorithm. The estimation of translation matrix generated during the transformati on of query language can be done efficiently by reducing the document space. The results of these estimations are promising. Corpora based methods suffers lack of resource s. Parallel corpora are not always readily availabl e and those that are availabl e tend to be relativel y small or to cover only a small number of subjects. The TEC- 2001: Cross Langu age Infor matio n Retrie val Track [2] MT- based Using Query translation A query translation is conducted with the degraded MT systems Techn ologies integrated in MT systems, such as syntactic and An ambigui ty problem exists in the MT compon ent, Literat ure review of Cross Langu age Infor and translated queries of varying quality are obtained. Then the translated queries are submitted to the IR system and performanc e is evaluated. semantic analysis, could help to improve the translation accuracy since the translate d query does not necessar ily represen ts the sense of the original query. matio n Retrie val[13 ] Ontolog y based - The Ontology based IR system submits the query keywords to XSL (Extensible style sheet Language) to query the ontologies, extracting related concepts and concept relations. 1) Since the ontology is represented in the form of XML, concepts, sub- concepts and relationship s can be mapped easily. 2) Enables reuse of domain knowledge 1) Increase s the creation difficult y. 2)Visual ization problem s The size of the resource (ontolog y) is inversel y proporti onal to its specifici ty. Chine se- Englis h CLIR based on Domai n Ontol ogy Knowl edge[5 ] Ontol ogy Appro ach for Cross- Langu age Infor matio n Retrie val [12] V.CONCLUSION The Internet has paved opportunities for increasing multi- lingual information exchange and retrieval in future. Cross- lingual IR provides new paradigms in searching documents through myriad varieties of languages across the world and it can be the baseline for searching not only among two languages but also in multiple. Creating accurate metadata in different languages in documents or good translation of key information in documents can help improve the quality of the index and retrieval. After the evaluation of both the pure dictionary and the ontology systems, the ontology based system scored higher in terms of precision. In future development ontology will be enhanced and extended by using annotation tools to align new concepts to the ontology and then test it again with the dictionary system. Other areas for investigation include ease of use, the use of relevance feedback, the effect of more extensive use of concept relations and possibly experiments with larger data sets. This paper discussed different types of CLIR techniques and advantages and disadvantages of each techniques.
  • 5. International Journal of Technical Research and Applications e-ISSN: 2320-8163, www.ijtra.com Volume 3, Issue 2 (Mar-Apr 2015), PP. 152-156 156 | P a g e REFERENCES [1] Mustafa Abusalah, John Tait and Micheal Oakes “Cross Language Information Retrieval using Multilingual Ontology as Translation and Query Expansion Base” September 2009. [2] F. C. Gey, “The TEC-2001: Cross Language Information Retrieval Track,” 2001. [3] Ari Pirkola, Turid Hedlund, Heikki Keskustalo, and Kalervo Järvelin, “Dictionary Based Cross Language Information Retrieval: Problems, Methods, and Research Finding” September 2001, Volume 4,pp 209-230. [4] Pattabhi R.K Rao and Sobha. L,“Cross Lingual Information Retrieval Track”,AU-KBC Research Centre, MIT Campus, Chennai,2010 [5] Feng YuI, Dequan Zheng and Tiejun Zhao, Sheng Li, Hao Yu, “Chinese-English Cross-Lingual Information Retrieval based on Domain Ontology Knowledge”,2010 [6] Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani, and Pushpak Bhattacharyya, “Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation”, Department of Computer Science and Engineering, IIT Bombay, India, 2008 [7] Sujoy Das, Anurag Seetha , M. Kumar and J. L. Rana, “Disambiguation Strategies for English-Hindi Cross Language Information Retrieval System”,2009 [8] Dinesh Mavaluru Dr. R. Shriram, “ Telugu English Cross Language Information Retrieval: A Case Study ”, 2013 [9] Saurabh Varshney and Jyoti Bajpai, “Improving performance of English-Hindi cross language information retrieval using transliteration of query terms”, 2013 [10] Nurjannaton Hidayah Rais, Muhamad Taufik Abdullah, Rabiah Abdul Kadir, “Query Translation Architecture for Malay-English Cross-Language Information Retrieval System”,2010 [11] Debasis Mandal, Sandipan Dandapat, Mayank Gupta, Pratyush Banerjee, Sudeshna Sarkar, “Bengali and Hindi to English Cross language Text Retrieval under Limited Resources”,2008 [12] S.M.Chaware and Srikanth Rao, “Ontology approach for cross language Information Retrieval”,2011 [13] Mustafa Abusalah, John Tait and Micheal Oakes “Literature review of Cross language information retrieval”,2007