SlideShare a Scribd company logo
Cross-Language Information Retrieval
University of Arizona

Sumin Byeon

1
Overview
안드로이드 이메일 암호화&

Matching&
algorithm&

Bilingual&
corpus&
database&

Results&in&
English&

Android&email&encryp3on&

Google&
Search&

2
Background
•

Corpus - a collection of written text; a single word or multiple words, or even
phrases and sentences

•

Comparable corpus - a collection of text from pairs of languages referring to
the same domain[1]; (source text, target text) pair

•

N-gram - n-character or n-word slice of a longer string[2]. We refer n-character
slices by the term n-gram. We use 4-gram (four-gram or quad-gram)

•

Source language - the language of the original phrases

•

Target language - the language into which CLIR translates the original phrases
[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.
[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print.

3
Motivation
•

Desire to acquire information even if the information is not
sufficiently available in their native language

•

Survey has shown people have a higher foreign language
proficiency level in reading than in writing

•

CLIR may bridge the gap between their desire to obtain
information and unavailability or under-availability of such
information in their native language

4
Goals
•

Allow users to query for domain-specific (i.e., computer science and software
engineering) information in their native language

•

Present relevant search results in the target language; the language in which
the largest amount of information is available

5
Components
•

Domain-specific bilingual corpus extraction from multiple sources

•

Corpus indexing

•

Querying and string matching

6
Corpus Extraction

7
Corpus Indexing
(S, T) -> (i1, h1), (i2, h2), …, (in, hn)

•

Java$

•

Quad-grams (k=4)

0:$Java$(20451)$

•

Fingerprint overlapping is okay, although it is not the most
space-efficient way

global$variable$

자바$

Frequency

전역 변수$

3:$bal_$(14870)$

50000

8:$aria$(14269)$

37500

25000

example$

예제$

12500

1:$xamp$(20451)$
0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

86

88

90

92

95

97

99 103

8
Querying & Matching
Java$global$variable$example$$

Java$

자바$

0:$Java$(20451)$

0:$Java$(20451)$
1:$ava_$(24085)$

…$

global$variable$

8:$bal_$(14870)$

전역 변수$

3:$bal_$(14870)$

…$

8:$aria$(14269)$

13:$aria$(14269)$

…$
22:$xamp$(20451)$

example$

예제$

1:$xamp$(20451)$

9
Multiple Candidates
global&variable&

•
•

Longest match first
Confidence: how many times does this comparable
corpus pair appear in a set of documents?

3:&bal_&(14870)&
8:&aria&(14269)&

global&

•

Outcome of matching depends on the domain of the
documents stored in the database

전역 변수&

세계적인&

0:&loba&(25848)&

variable&

변수&

1:&aria&(14269)&

variable&

가변적인&

1:&aria&(14269)&
10
Indexing and Querying Recap

자바 전역 변수 예제!

자바 :!Java!
전역 :!transfer!
전역 :!all!parts!(of)!
전역 변수 :!global!variable!
변수 :!variable!
예제 :!example!

Java!global!variable!
example!!

11
Relationship with Content Addressability

자바 전역 변수 예제&
자바&

Java&

전역 변수&
예제&

global&variable&
example&

Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&
Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&
ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&
suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&
consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&
sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&
commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&
pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&
elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&
lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&
interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&
ornare&quis&ligula.&

12
Evaluation
•

Matching
•
•

•

Did it translate all the search terms to the target language properly?
Did it preserve domain-specific information?

Searching
•

Hit ratio: # of relevant web pages / # of results on the first page

•

Total number of search results
13
Evaluation
•

재귀 열거 집합 - recursively enumerable sets
•

•

배낭 문제 시간 복잡도 - 배낭 issue the time complexity
•

•

(3/3, 1/1)

(3/4, 1/2)

가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter
energy efficiency maximization
•

(7/7, 4/4)
14
Evaluation
•

Query in source language “재귀 열거 집합”
•

•

Query in target language “recursively enumerable sets”
•

•

(6/10, 15,300)

(10/10, 105,000)

Google Translate result “Set of recursive enumeration”
•

(10/10, 1,990,000)
15
Evaluation
•

Query in source language “배낭 문제 시간 복잡도”
•

•

Query in target language “배낭 issue time complexity”
•

•

(10/10, 31,200)

(2/6, 2,270)

Google Translate result “Knapsack problem, the time complexity”
•

(10/10, 206,000)
16
Evaluation
•

Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”
•

•

Query in target language “through virtualization datacenter energy efficiency
maximization”
•

•

(5/10, 36,100)

(8/10, 264,000)

Google Translate result “Maximize energy efficiency through data center
virtualization”
•

(10/10, 284,000)
17
Conclusion & Future Work
•

Preliminary results look satisfactory

•

Machine translation based CLIR appears to be more useful in many cases

•

Evaluation factors may not reflect the actual quality of the system

•

Labor-intensive evaluation process - need for an automated evaluation

•

Fuzzy matching based on lexical information (e.g., call, calls)

•

Fuzzy matching based on semantic information (e.g., maximize, maximizing,
maximization, maximum)
18

More Related Content

PPT
Cross language information retrieval (clir)slide
PPT
07 04-06
PPTX
Ir 1 lec 7
PDF
Cross-lingual Information Retrieval
PPTX
Text Mining Infrastructure in R
PDF
Applications of Word Vectors in Text Retrieval and Classification
PDF
Text Mining Analytics 101
PPTX
Enriching the semantic web tutorial session 1
Cross language information retrieval (clir)slide
07 04-06
Ir 1 lec 7
Cross-lingual Information Retrieval
Text Mining Infrastructure in R
Applications of Word Vectors in Text Retrieval and Classification
Text Mining Analytics 101
Enriching the semantic web tutorial session 1

What's hot (20)

PDF
Text Mining with R
PPT
Chapter 10 Data Mining Techniques
PPT
Copy of 10text (2)
PDF
Working with text data
PDF
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
PPTX
Open nlp presentationss
PDF
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
PPT
Profile of NPOESS HDF5 Files
PDF
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
PDF
Python-Introduction-slides-pkt
PDF
Presentation of OpenNLP
PDF
Topic Modelling and APIs
PDF
IE: Named Entity Recognition (NER)
PPTX
PPTX
Topic Extraction on Domain Ontology
PDF
Bio ontologies and semantic technologies
PPTX
The vector space model
PDF
SAC 2019 ester giallonardo
PPTX
Text Mining with R
Chapter 10 Data Mining Techniques
Copy of 10text (2)
Working with text data
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Open nlp presentationss
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Profile of NPOESS HDF5 Files
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Python-Introduction-slides-pkt
Presentation of OpenNLP
Topic Modelling and APIs
IE: Named Entity Recognition (NER)
Topic Extraction on Domain Ontology
Bio ontologies and semantic technologies
The vector space model
SAC 2019 ester giallonardo
Ad

Viewers also liked (20)

PPTX
Ponsetti,bermudez,nellen,gaido
PDF
Actualog - Facebook для сложных технических изделий, материалов, оборудования
PPT
Mano miestas Tokijus
PPT
第7章 语法制导翻译和中间代码生成
PPTX
Blog pp cultural diversity
PPSX
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
PPSX
Presentation to Global Hair & Fashion Group Members
PPT
Internet marketing overview
PPTX
動畫表演
PPTX
Professional Business Results & Selected Accomplishments
PPT
K401 L2
PPTX
день семьи
PPTX
PDF
Schoo01 130906042632-
PPTX
Presentación t3
PDF
Depositos de agua (SPANISH)
PPTX
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
PDF
iPad Crazy Session
PPTX
東京ソーシャルデザイン研究所Ver4ドラフト
PPTX
Gamze bilg ödevi
Ponsetti,bermudez,nellen,gaido
Actualog - Facebook для сложных технических изделий, материалов, оборудования
Mano miestas Tokijus
第7章 语法制导翻译和中间代码生成
Blog pp cultural diversity
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
Presentation to Global Hair & Fashion Group Members
Internet marketing overview
動畫表演
Professional Business Results & Selected Accomplishments
K401 L2
день семьи
Schoo01 130906042632-
Presentación t3
Depositos de agua (SPANISH)
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
iPad Crazy Session
東京ソーシャルデザイン研究所Ver4ドラフト
Gamze bilg ödevi
Ad

Similar to Cross-Language Information Retrieval (8)

PDF
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
Cross language information retrieval in indian
PDF
PDF
C1803021622
PDF
A Review on the Cross and Multilingual Information Retrieval
PDF
Cross Lingual Information Retrieval Using Search Engine and Data Mining
PDF
Improving performance of english hindi cross language information retrieval u...
PDF
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
Cross language information retrieval in indian
C1803021622
A Review on the Cross and Multilingual Information Retrieval
Cross Lingual Information Retrieval Using Search Engine and Data Mining
Improving performance of english hindi cross language information retrieval u...
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD

More from Sumin Byeon (16)

PDF
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PDF
BD Talk 2017 봄 - 원정코딩
PDF
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
PDF
Are Credit Cards Evil
PDF
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
PDF
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
PDF
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
PDF
2015 PyCon - 프로그래머가 이사하는 법
PDF
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
PDF
SLINKY: Static Linking Reloaded
PDF
Project Proposal: Translation Example Search Engine
PDF
Self-Tuning Wireless Network Power Management
PDF
Error tolerant search
KEY
Git with bitbucket
KEY
Git with bitbucket (draft)
KEY
RNA Secondary Structure Prediction
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
BD Talk 2017 봄 - 원정코딩
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
Are Credit Cards Evil
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
2015 PyCon - 프로그래머가 이사하는 법
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
SLINKY: Static Linking Reloaded
Project Proposal: Translation Example Search Engine
Self-Tuning Wireless Network Power Management
Error tolerant search
Git with bitbucket
Git with bitbucket (draft)
RNA Secondary Structure Prediction

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
project resource management chapter-09.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Tartificialntelligence_presentation.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Hybrid model detection and classification of lung cancer
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Modernising the Digital Integration Hub
PDF
STKI Israel Market Study 2025 version august
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A comparative study of natural language inference in Swahili using monolingua...
project resource management chapter-09.pdf
Getting started with AI Agents and Multi-Agent Systems
Tartificialntelligence_presentation.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
Enhancing emotion recognition model for a student engagement use case through...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Chapter 5: Probability Theory and Statistics
Hybrid model detection and classification of lung cancer
gpt5_lecture_notes_comprehensive_20250812015547.pdf
1. Introduction to Computer Programming.pptx
O2C Customer Invoices to Receipt V15A.pptx
Modernising the Digital Integration Hub
STKI Israel Market Study 2025 version august
Hindi spoken digit analysis for native and non-native speakers
Web App vs Mobile App What Should You Build First.pdf
OMC Textile Division Presentation 2021.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...

Cross-Language Information Retrieval

  • 3. Background • Corpus - a collection of written text; a single word or multiple words, or even phrases and sentences • Comparable corpus - a collection of text from pairs of languages referring to the same domain[1]; (source text, target text) pair • N-gram - n-character or n-word slice of a longer string[2]. We refer n-character slices by the term n-gram. We use 4-gram (four-gram or quad-gram) • Source language - the language of the original phrases • Target language - the language into which CLIR translates the original phrases [1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264. [2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print. 3
  • 4. Motivation • Desire to acquire information even if the information is not sufficiently available in their native language • Survey has shown people have a higher foreign language proficiency level in reading than in writing • CLIR may bridge the gap between their desire to obtain information and unavailability or under-availability of such information in their native language 4
  • 5. Goals • Allow users to query for domain-specific (i.e., computer science and software engineering) information in their native language • Present relevant search results in the target language; the language in which the largest amount of information is available 5
  • 6. Components • Domain-specific bilingual corpus extraction from multiple sources • Corpus indexing • Querying and string matching 6
  • 8. Corpus Indexing (S, T) -> (i1, h1), (i2, h2), …, (in, hn) • Java$ • Quad-grams (k=4) 0:$Java$(20451)$ • Fingerprint overlapping is okay, although it is not the most space-efficient way global$variable$ 자바$ Frequency 전역 변수$ 3:$bal_$(14870)$ 50000 8:$aria$(14269)$ 37500 25000 example$ 예제$ 12500 1:$xamp$(20451)$ 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 86 88 90 92 95 97 99 103 8
  • 9. Querying & Matching Java$global$variable$example$$ Java$ 자바$ 0:$Java$(20451)$ 0:$Java$(20451)$ 1:$ava_$(24085)$ …$ global$variable$ 8:$bal_$(14870)$ 전역 변수$ 3:$bal_$(14870)$ …$ 8:$aria$(14269)$ 13:$aria$(14269)$ …$ 22:$xamp$(20451)$ example$ 예제$ 1:$xamp$(20451)$ 9
  • 10. Multiple Candidates global&variable& • • Longest match first Confidence: how many times does this comparable corpus pair appear in a set of documents? 3:&bal_&(14870)& 8:&aria&(14269)& global& • Outcome of matching depends on the domain of the documents stored in the database 전역 변수& 세계적인& 0:&loba&(25848)& variable& 변수& 1:&aria&(14269)& variable& 가변적인& 1:&aria&(14269)& 10
  • 11. Indexing and Querying Recap 자바 전역 변수 예제! 자바 :!Java! 전역 :!transfer! 전역 :!all!parts!(of)! 전역 변수 :!global!variable! 변수 :!variable! 예제 :!example! Java!global!variable! example!! 11
  • 12. Relationship with Content Addressability 자바 전역 변수 예제& 자바& Java& 전역 변수& 예제& global&variable& example& Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.& Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor& ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis& suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae& consectetur&risus.&global&variable&Aenean&justo&risus,&mollis& sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,& commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin& pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non& elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum& lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,& interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,& ornare&quis&ligula.& 12
  • 13. Evaluation • Matching • • • Did it translate all the search terms to the target language properly? Did it preserve domain-specific information? Searching • Hit ratio: # of relevant web pages / # of results on the first page • Total number of search results 13
  • 14. Evaluation • 재귀 열거 집합 - recursively enumerable sets • • 배낭 문제 시간 복잡도 - 배낭 issue the time complexity • • (3/3, 1/1) (3/4, 1/2) 가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter energy efficiency maximization • (7/7, 4/4) 14
  • 15. Evaluation • Query in source language “재귀 열거 집합” • • Query in target language “recursively enumerable sets” • • (6/10, 15,300) (10/10, 105,000) Google Translate result “Set of recursive enumeration” • (10/10, 1,990,000) 15
  • 16. Evaluation • Query in source language “배낭 문제 시간 복잡도” • • Query in target language “배낭 issue time complexity” • • (10/10, 31,200) (2/6, 2,270) Google Translate result “Knapsack problem, the time complexity” • (10/10, 206,000) 16
  • 17. Evaluation • Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화” • • Query in target language “through virtualization datacenter energy efficiency maximization” • • (5/10, 36,100) (8/10, 264,000) Google Translate result “Maximize energy efficiency through data center virtualization” • (10/10, 284,000) 17
  • 18. Conclusion & Future Work • Preliminary results look satisfactory • Machine translation based CLIR appears to be more useful in many cases • Evaluation factors may not reflect the actual quality of the system • Labor-intensive evaluation process - need for an automated evaluation • Fuzzy matching based on lexical information (e.g., call, calls) • Fuzzy matching based on semantic information (e.g., maximize, maximizing, maximization, maximum) 18