SlideShare a Scribd company logo
Combining Evidence
for Cross-language Information Retrieval
Petra Galuscakova
petra@umd.edu
University of Maryland, College Park
03/31/2020
Cross-language Information Retrieval (CLIR)
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Plaquenil
side effects
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Plaquenil
side effects
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Plaquenil
side effects
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Plaquenil
side effects
… troubles de
l'accommodation, maladie de
la rétine, régressant à l'arrêt
du traitement....
Cross-language Information Retrieval (CLIR)
What are the possible side
effects of Plaquenil?
Plaquenil
side effects
… troubles de
l'accommodation, maladie de
la rétine, régressant à l'arrêt
du traitement....
… accommodation
disorders, retinal disease,
regressing when treatment
is stopped...
- Swahili
- Tagalog
- Somali
- Lithuanian
- Bulgarian
- Pashto
Machine Translation for English Retrieval of
Information in Any Language (MATERIAL)
Demo
- Swahili
- Tagalog
- Somali
- Lithuanian
- Bulgarian
- Pashto
Machine Translation for English Retrieval of
Information in Any Language (MATERIAL)
Demo
Outline
1. How to build a CLIR system?
2. How to combine multiple systems?
3. How to pick out systems for the combination?
Outline
1. How to build a CLIR system?
2. How to combine multiple systems?
3. How to pick out systems for the combination?
CLIR Pipeline
13
Recording
Speech
Recognition
Speech processing
CLIR Pipeline
14
Recording
Speech
Recognition
Speech processing
Translation
Plaquenil
side effects
Translation
Plaquenil
side effects
Plaquenil
Effets secondaires
Translation
Plaquenil
side effects
Plaquenil
Effets secondaires
Query Translation
Translation
Plaquenil
side effects
Plaquenil
Effets secondaires
… troubles de
l'accommodation, maladie de
la rétine, régressant à l'arrêt
du traitement....
Query Translation
Translation
Plaquenil
side effects
Plaquenil
Effets secondaires
… troubles de
l'accommodation, maladie de
la rétine, régressant à l'arrêt
du traitement....
… accommodation
disorders, retinal disease,
regressing when treatment
is stopped...
Query Translation
Translation
Plaquenil
side effects
Plaquenil
Effets secondaires
… troubles de
l'accommodation, maladie de
la rétine, régressant à l'arrêt
du traitement....
… accommodation
disorders, retinal disease,
regressing when treatment
is stopped...
Query Translation
Document
Translation
Translation Approaches
● Neural Machine Translation v1 trained using Marian
● Neural Machine Translation v2 trained without punctuation and casing using Marian
● Neural Machine Translation v3 trained using Sockeye
● Statistical Machine Translation trained using Moses
● Probabilistic Structured Queries (PSQ) which uses trained translation probabilities
● 1-best and N-best versions of the systems
● Trained using different data
Translation Approaches
● Neural Machine Translation v1 trained using Marian
● Neural Machine Translation v2 trained without punctuation and casing using Marian
● Neural Machine Translation v3 trained using Sockeye
● Statistical Machine Translation trained using Moses
● Probabilistic Structured Queries (PSQ) which uses trained translation probabilities
● 1-best and N-best versions of the systems
● Trained using different data
Translation Approaches
● Neural Machine Translation v1 trained using Marian
● Neural Machine Translation v2 trained without punctuation and casing using Marian
● Neural Machine Translation v3 trained using Sockeye
● Statistical Machine Translation trained using Moses
● Probabilistic Structured Queries (PSQ) which uses trained translation probabilities
● 1-best and N-best versions of the systems
● Trained using different data
P (das Haus | house) = 0.5
P (der Haushalt | house) = 0.3
P (das Parlament | house) = 0.2
Translation Approaches
● Neural Machine Translation v1 trained using Marian
● Neural Machine Translation v2 trained without punctuation and casing using Marian
● Neural Machine Translation v3 trained using Sockeye
● Statistical Machine Translation trained using Moses
● Probabilistic Structured Queries (PSQ) which uses trained translation probabilities
● 1-best and N-best versions of the systems
● Trained using different data
P (das Haus | house) = 0.5
P (der Haushalt | house) = 0.3
P (das Parlament | house) = 0.2
CLIR Pipeline
25
Recording
Speech
Recognition
Speech processing
CLIR Pipeline
26
Recording
Speech
Recognition
Speech processing
Collection Processing
● Character normalization
○ Tokenization (e.g. aren’t → aren ‘ t)
○ Lowercasing (e.g. New York →new york)
○ Punctuation removal (e.g. ex–prime minister →ex prime minister)
○ Diacritics removal (e.g. Galuščáková →Galuscakova)
○ Remove non-alphabetic characters (e.g. abc@gmail.com -> abc gmail com)
● Stemming (e.g. feeding cats → feed cat)
○ Applied during retrieval
○ Applied before MT training
CLIR Pipeline
28
Recording
Speech
Recognition
Speech processing
CLIR Pipeline
29
Recording
Speech
Recognition
Speech processing
CLIR Pipeline
30
Recording
Speech
Recognition
Speech processing
side effects ≠ “side effects”
CLIR Pipeline
31
Recording
Speech
Recognition
Speech processing
Matcher
● Indri Language Model:
● Hidden Markov Model:
● Probabilistic Term Occurrence:
CLIR Pipeline
33
Recording
Speech
Recognition
Speech processing
CLIR Pipeline
34
Recording
Speech
Recognition
Speech processing
572 experiments available
on the Development set
Outline
1. How to build a CLIR system?
2. How to combine multiple systems?
3. How to pick out systems for the combination?
Normalization and System Combination
IR System 1
IR System 2
IR System 3
System CombinationQueryY
IR System 4
Document List Y
QueryZ
QueryX Document List X
Document List Z
Score Normalization
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
2
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
2
Systems Combination Approaches
S1 S2 S3
A 0.8 C -0.9 A 45
B 0.75 D -0.5 D 42
C 0.32
-
E 30
... ...
2
W = 35 W = 22 W = 15
Normalization and System Combination
IR System 1
IR System 2
IR System 3
System CombinationQueryY
IR System 4
Document List Y
QueryZ
QueryX Document List X
Document List Z
Score Normalization
Score Normalization
Ranks
Scores
Borda Counts
Reciprocal Ranks
Sum-To-One
Min-Max
Score = N - rank + 1
Where (N is number of returned documents)
Score = 1 / rank
Combination/Normalization
MAP scores for LT [Q2/EVAL, text only], 10,203 documents, 1,000 queries
System Combination
(WeightCombMNZ + Sum-To-One Norm)
October 2019 evaluation [Q2/EVAL] 10,203 documents, 1,000 queries
Outline
1. How to build a CLIR system?
2. How to combine multiple systems?
3. How to pick out systems for the combination?
System Selection
● How to pick a system combination that will work well on evaluation set?
● How to do so efficiently?
○ Thousands of possible setups
○ ~500 systems available on the Development set
○ 1020
ways to select 10 systems for combination
○ Need to reduce number of required experiments
Manual System Selection
Manual System Selection
Manual System Selection
572 systems of LT
on Development set
Pre-selection
of 15 systems
All 6-way
combinations
System
Selection
Human Clusters System Selection
572 systems of LT
on Development set
Pre-selection
of 15 systems
All 6-way
combinations
System
Selection
Manual
Best
Clustering
572 systems of LT
on Development set
● Correctly retrieved
Query/Document
relevance pairs
● Jaccard coefficient
● Spectral clustering
Clustering
572 systems of LT
on Development set
● Correctly retrieved
Query/Document
relevance pairs
● Jaccard coefficient
● Spectral clustering
https://towardsdatascience.com/unsupervised-machine-learning-spectral-clustering-algorit
hm-implemented-from-scratch-in-python-205c87271045
Systems Selection
MAP scores for LT [Q2/EVAL, text]
10,203 documents, 1,000 queries
Systems Selection
MAP scores for LT [Q2/EVAL, text]
10,203 documents, 1,000 queries
Systems Selection
MAP scores for LT [Q2/EVAL, text]
10,203 documents, 1,000 queries
Systems Selection
MAP scores for LT [Q2/EVAL, text]
10,203 documents, 1,000 queries
Random Selection
MAP scores for LT [Q2/EVAL, text], WeightCombMNZ with STO normalization, 10,203 documents, 1,000 queries
Takeaways
● System combination achieves 40% improvement comparing with a single
best system
● Selecting systems for a combination is crucial for effectiveness and
perfomance
● Human-aided and manual methods work best
○ But fully automatic clustering-based method is robust
Other Research Directions
https://www.dropbox.com/s/vkxl9awdep80sq5/shamus.mp4?dl=0
Visual Features
● Similar setting:
- Feature Signatures
● Concept detection and
visual object similarity
● Same faces
Video Retrieval Results
C - Context
M - Metadata
V - Visual Features
MediaEval 2014 Search and Hyperlinking
4,000 hours of BBC broadcast
MAP-tol
Video Retrieval Results
C - Context
M - Metadata
V - Visual Features
MediaEval 2014 Search and Hyperlinking
4,000 hours of BBC broadcast
MAP-tol

More Related Content

PDF
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
Cross language information retrieval in indian
PPTX
Ir 1 lec 7
PPT
07 04-06
PDF
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
PDF
Syntactic Indexes for Text Retrieval
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Survey on Indian CLIR and MT systems in Marathi Language
Cross language information retrieval in indian
Ir 1 lec 7
07 04-06
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
Syntactic Indexes for Text Retrieval

Similar to Combining Evidence for Cross-language Information Retrieval (20)

PPTX
Using Semantic and Domain-based Information in CLIR Systems
PDF
Improving performance of english hindi cross language information retrieval u...
PDF
A Review on the Cross and Multilingual Information Retrieval
PDF
PDF
An unsupervised approach to develop ir system the case of urdu
PDF
Transliteration and translation of the Hindi language using integrated domain...
PDF
Designing Cross-Language Information Retrieval System using various Technique...
PDF
A language independent approach to develop urduir system
PDF
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
PDF
Cross Lingual Information Retrieval Using Search Engine and Data Mining
PDF
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
DOCX
Multilingualism in Information Retrieval System
PPTX
Unit_4- Principles of AI explaining the importants of AI
PPTX
Machine translation from English to Hindi
PDF
Marianne Lykkes presentation at ASIS&T Conference
DOC
bbliografia.doc
PPTX
Cross-Cultural_Communication_Challenges_
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
PPTX
Query Understanding
PDF
Text databases and information retrieval
Using Semantic and Domain-based Information in CLIR Systems
Improving performance of english hindi cross language information retrieval u...
A Review on the Cross and Multilingual Information Retrieval
An unsupervised approach to develop ir system the case of urdu
Transliteration and translation of the Hindi language using integrated domain...
Designing Cross-Language Information Retrieval System using various Technique...
A language independent approach to develop urduir system
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
Cross Lingual Information Retrieval Using Search Engine and Data Mining
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
Multilingualism in Information Retrieval System
Unit_4- Principles of AI explaining the importants of AI
Machine translation from English to Hindi
Marianne Lykkes presentation at ASIS&T Conference
bbliografia.doc
Cross-Cultural_Communication_Challenges_
kantorNSF-NIJ-ISI-03-06-04.ppt
Query Understanding
Text databases and information retrieval
Ad

More from Petra Galuscakova (10)

PDF
Multimodal Features for Linking Television Content
PDF
Czech Malach Cross-lingual Speech Retrieval Test Collection
PDF
Audio Information for Hyperlinking of TV Content
PDF
Multimodal Features for Search and Hyperlinking of Video Content
PDF
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
PDF
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
PDF
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
PDF
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
PDF
Application of Topic Segmentation in Audiovisual Information Retrieval
PDF
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Multimodal Features for Linking Television Content
Czech Malach Cross-lingual Speech Retrieval Test Collection
Audio Information for Hyperlinking of TV Content
Multimodal Features for Search and Hyperlinking of Video Content
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
Application of Topic Segmentation in Audiovisual Information Retrieval
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Ad

Recently uploaded (20)

PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPT
LEC Synthetic Biology and its application.ppt
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
perinatal infections 2-171220190027.pptx
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
Substance Disorders- part different drugs change body
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
gene cloning powerpoint for general biology 2
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPTX
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
PPT
Presentation of a Romanian Institutee 2.
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
LEC Synthetic Biology and its application.ppt
Fluid dynamics vivavoce presentation of prakash
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
perinatal infections 2-171220190027.pptx
The Land of Punt — A research by Dhani Irwanto
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Substance Disorders- part different drugs change body
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
gene cloning powerpoint for general biology 2
A powerpoint on colorectal cancer with brief background
endocrine - management of adrenal incidentaloma.pptx
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
Presentation of a Romanian Institutee 2.

Combining Evidence for Cross-language Information Retrieval