Cross-Language Information Retrieval

Cross-Language Information Retrieval
University of Arizona

Sumin Byeon

1

Overview
안드로이드 이메일 암호화&

Matching&
algorithm&

Bilingual&
corpus&
database&

Results&in&
English&

Android&email&encryp3on&

Google&
Search&

2

Background
•

Corpus - a collection of written text; a single word or multiple words, or even
phrases and sentences

•

Comparable corpus - a collection of text from pairs of languages referring to
the same domain[1]; (source text, target text) pair

•

N-gram - n-character or n-word slice of a longer string[2]. We refer n-character
slices by the term n-gram. We use 4-gram (four-gram or quad-gram)

•

Source language - the language of the original phrases

•

Target language - the language into which CLIR translates the original phrases
[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.
[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print.

3

Motivation
•

Desire to acquire information even if the information is not
sufﬁciently available in their native language

•

Survey has shown people have a higher foreign language
proﬁciency level in reading than in writing

•

CLIR may bridge the gap between their desire to obtain
information and unavailability or under-availability of such
information in their native language

4

Goals
•

Allow users to query for domain-speciﬁc (i.e., computer science and software
engineering) information in their native language

•

Present relevant search results in the target language; the language in which
the largest amount of information is available

5

Components
•

Domain-speciﬁc bilingual corpus extraction from multiple sources

•

Corpus indexing

•

Querying and string matching

6

Corpus Indexing
(S, T) -> (i1, h1), (i2, h2), …, (in, hn)

•

Java$

•

Quad-grams (k=4)

0:$Java$(20451)$

•

Fingerprint overlapping is okay, although it is not the most
space-efﬁcient way

global$variable$

자바$

Frequency

전역 변수$

3:$bal_$(14870)$

50000

8:$aria$(14269)$

37500

25000

example$

예제$

12500

1:$xamp$(20451)$
0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

86

88

90

92

95

97

99 103

8

Querying & Matching
Java$global$variable$example$$

Java$

자바$

0:$Java$(20451)$

0:$Java$(20451)$
1:$ava_$(24085)$

…$

global$variable$

8:$bal_$(14870)$

전역 변수$

3:$bal_$(14870)$

…$

8:$aria$(14269)$

13:$aria$(14269)$

…$
22:$xamp$(20451)$

example$

예제$

1:$xamp$(20451)$

9

Multiple Candidates
global&variable&

•
•

Longest match ﬁrst
Conﬁdence: how many times does this comparable
corpus pair appear in a set of documents?

3:&bal_&(14870)&
8:&aria&(14269)&

global&

•

Outcome of matching depends on the domain of the
documents stored in the database

전역 변수&

세계적인&

0:&loba&(25848)&

variable&

변수&

1:&aria&(14269)&

variable&

가변적인&

1:&aria&(14269)&
10

Indexing and Querying Recap

자바 전역 변수 예제!

자바 :!Java!
전역 :!transfer!
전역 :!all!parts!(of)!
전역 변수 :!global!variable!
변수 :!variable!
예제 :!example!

Java!global!variable!
example!!

11

Relationship with Content Addressability

자바 전역 변수 예제&
자바&

Java&

전역 변수&
예제&

global&variable&
example&

Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&
Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&
ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&
suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&
consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&
sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&
commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&
pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&
elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&
lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&
interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&
ornare&quis&ligula.&

12

Evaluation
•

Matching
•
•

•

Did it translate all the search terms to the target language properly?
Did it preserve domain-speciﬁc information?

Searching
•

Hit ratio: # of relevant web pages / # of results on the ﬁrst page

•

Total number of search results
13

Evaluation
•

재귀 열거 집합 - recursively enumerable sets
•

•

배낭 문제 시간 복잡도 - 배낭 issue the time complexity
•

•

(3/3, 1/1)

(3/4, 1/2)

가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter
energy efﬁciency maximization
•

(7/7, 4/4)
14

Evaluation
•

Query in source language “재귀 열거 집합”
•

•

Query in target language “recursively enumerable sets”
•

•

(6/10, 15,300)

(10/10, 105,000)

Google Translate result “Set of recursive enumeration”
•

(10/10, 1,990,000)
15

Evaluation
•

Query in source language “배낭 문제 시간 복잡도”
•

•

Query in target language “배낭 issue time complexity”
•

•

(10/10, 31,200)

(2/6, 2,270)

Google Translate result “Knapsack problem, the time complexity”
•

(10/10, 206,000)
16

Evaluation
•

Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”
•

•

Query in target language “through virtualization datacenter energy efﬁciency
maximization”
•

•

(5/10, 36,100)

(8/10, 264,000)

Google Translate result “Maximize energy efﬁciency through data center
virtualization”
•

(10/10, 284,000)
17

Conclusion & Future Work
•

Preliminary results look satisfactory

•

Machine translation based CLIR appears to be more useful in many cases

•

Evaluation factors may not reﬂect the actual quality of the system

•

Labor-intensive evaluation process - need for an automated evaluation

•

Fuzzy matching based on lexical information (e.g., call, calls)

•

Fuzzy matching based on semantic information (e.g., maximize, maximizing,
maximization, maximum)
18

Cross-Language Information Retrieval

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Cross-Language Information Retrieval (8)

More from Sumin Byeon (16)

Recently uploaded (20)

Cross-Language Information Retrieval