SlideShare a Scribd company logo
Leveraging Semantic and Lexical Matching to Improve the
Recall of Document Retrieval Systems: A Hybrid Approach
PR-285
https://www.flaticon.com/kr/authors/freepik
https://arxiv.org/pdf/2010.01195.pdf
1. Research Background
1. Research Background
Information Retrieval (IR)
• Lexical approach
3/24
Inverted Index Retrieval task
https://giyatto.tistory.com/2 https://devopedia.org/information-retrieval 1) Retrieval stage 2) re-ranking stage
Lexical approach (ex. BM25) : Query document term
1. Research Background
Lexical approach …
• Vocabulary mismatch problem
4/24
폐쇄된 도로와 빙판 고속도로로 Idaho 에서 17 중 교통사고로
적어도 한 명의 운전자가 사망하고 ,
Sierra Nevada 고속도로의 빙결 구간에서 투어 버스 추락,
시애틀 근처에서 100 대의 차량 사고가 발생
Oklahoma와 South Carolina는 각각 3명의 사망자(fatalities)를
기록
Arizona, Kentucky, Missouri, Utah, Virginia에는 각각 두 명이 있
었다. 한 해 동안 한 번의 번개로 인한 사망을 기록한 것은
Washington D.C.; Kansas, Montana, North Dakota,
• Lexical approach query term .
1. Research Background
Retrieval stage semantic model
5/24
• Semantic models tend to have lower recall
[CVPR 2013]
• Using neural networks for retrieval had a very high cost
- Query embedding document embedding
MIPS Query
PR-272
1. Research Background
Related work
6/24
• Semantic approach
Optimized Product Quantization
[CVPR 2013]
• lexical - semantic hybrid approach
- BERT-based re-ranking models (Q-D relevance score BERT)
QA systems, conversational agents, and product search lexical approach
- BERT inverted-index
- Neural network Query expansion : query
- Latent Semantic Indexing (LSI)
- Neural network based document embedding -> kNN based search
1. Research Background
Objective & Approach : a hybrid retrieval approach
• We propose a lexical-semantic hybrid retrieval approach .
7/24
• Commercial system hybrid
• end-to-end weakly supervised approach
• lexical-only approach
• Lexical model, sementic model, combination model
2. Methods
2. Methods
The hybrid retrieval approach
9/24
• Lexical based approach ( BM25 (Anserini toolkit); https://littlefoxdiary.tistory.com/12 )
.
• Weakly supervised learning , training data .
• Approximate kNN search system latency .
• Open-source .
Hybrid system
c
c
c
NN for semantic retrieval model
BM25
2. Methods
BERT based semantic retrieval model
10/24
NN for semantic retrieval model
• BERT architecture : 6 layers, a hidden size of 256, and 4 attention heads
• Training params : Adam optimizer, learning rate of 5e-4 and a batch size of 32 for 5 million training steps.
• Vocabulary : 7500 words
2. Methods 11/24
Document – Query dataset
“Weather Realted Fatalites”
“Information Retrieval”
…
“Sunghoon Joo”
BM25
50 문서
20 문서
3 문서
5 문장 추출
1. Query ( 5 tri-grams bi-grams)
2. BM25 document 10 query
3. query terms 5 : Doc-Query pairing
(BERT ,
.)
5 문장 추출
2. Methods 12/24
Document – Query dataset
“Weather Realted Fatalites” Over the last five years, weather-related
fatalites are down 19% from 2015 …
Over the last five years, Sunghoon Joo are down
19% from 2015 …
1
0
Over the last five years, traffic related fatalites are
down 19% from 2015 …
1
0
1 0.65
Over the last five years, traffic related projects are
down 19% from 2015 …
1 0.55
“Information Retrieval” Information dataset (IR) is finding material (usually
documents) of an unstructured nature …
1 0.6
2. Methods 13/24
Document – Query dataset
• A TREC collection (disks 1&2)
• 441,676 news-wire
• 51, 200 TREC topics query .
• Training data set : 3.8M bi-gram queries, 1.7M tri-gram queries, and about 1B training
examples (passage-query pairs).
2. Methods
Hybrid Merging – RM3
14/24
• RM3 semantic result list .
• Anserini toolkit opensource .
• Query processing time
Hybrid system relevance model RM3 *
c
c
c
*Nasreen Abdul-Jaleel et al., UMass at TREC 2004: Novelty and HARD
https://www.cl.cam.ac.uk/teaching/1617/InfoRtrv/lecture7-relevance-feedback.pdf
• c Lexical Result List RM3 , 2c c .
3. Experimental Results
3. Experimental Results
Experiment 1: Hybrid approach Lexical approach
• Lexical approach semantic approach (neural model re-ranking stage )
• Hybrid approach Lexical approach c (semantic approach list )
16/24
Lexical Result
List (c)
• --------
• --------
• --------
• --------
…
Semantic
Result List (c)
• --------
• --------
• --------
• --------
…
Lexical Result List에서 쿼리와 관계없는 결과를
Semantic Result List에서 대체함
3. Experimental Results
Experiment 2: Hybrid approach
17/24
• RM3 re-ranking ? (1000 lexical list 500 RM3 )
Semantic result list 2000개 고정 →
lexical result lists를 바꿔가며 실험 (∈ {500, 1500, 1000,
2000}) → Merge (최종 문서 수는 초기 lexical based
results와 같게)
• RM3 Merging : Hybrid approach
• Semantic result set Lexical result list merge ,
3. Experimental Results
Experiment 3: Query hybrid approach
• queries hybrid approach (40%) (50%) .
• Hybrid approach query robust
18/24
3. Experimental Results
• Query set (Q1) Hybrid approach
.
19/24
Experiment 3: Query hybrid approach
검색 성능에 따라 Q1, Q2, Q3, Q4 로 Query set을 구성함
df (document frequency) : (the, an, a …)
idf :
• Query Hybrid approach
(Neural net. )
• Query (high idf) lexical model
, hybrid approach
3. Experimental Results
Experiment 4: lexical result semantic result
20/24
• Hybrid approach topic coverage .
• Lexical approach Semantic approach .
Tf-idf vectorization →
t-SNE visualization
(a)
(b) (c)
3. Experimental Results 21/24
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → Term list 작성 → Jaccard index 계산
• Lexical approach Semantic approach (low Jaccard index).
Experiment 4: lexical result semantic result
3. Experimental Results 22/24
• Lexical approach (BM25).
• Semantic approach .
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → 각 250개 문서에 대해 길이 분석
Experiment 4: lexical result semantic result
4. Conclusion
4. Conclusions 24/24
Thank you.
• Retrieval stage Lexical approach Semantic
approach .
• Hybrid approach ,
Hybrid approach Lexical, Semantic approach
.
• Future works:
1) Hybrid approach
2)
3) QA system, recommendation, conversational agents information
retrieval task

More Related Content

PPTX
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
PPTX
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
PDF
Machine Learning in Chemistry: Part I
PPTX
Accessing Environmental Chemistry Data via Data Dashboards
PPTX
Introduction to Big data
PPTX
Strategies for Processing and Explaining Distributed Queries on Linked Data
PPTX
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
PPTX
Data formats and ontologies
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Machine Learning in Chemistry: Part I
Accessing Environmental Chemistry Data via Data Dashboards
Introduction to Big data
Strategies for Processing and Explaining Distributed Queries on Linked Data
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
Data formats and ontologies

Similar to PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach (20)

PPTX
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
PPTX
Profiling Linked Open Data
PPTX
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
PDF
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
PPTX
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
PPTX
Data-mining the Semantic Web
PPTX
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
PDF
Algoritmi Genetici
PPTX
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
PPTX
Using Linked Data to Mine RDF from Wikipedia's Tables
PPTX
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
PPTX
The Matrix: connecting and re-using digital records of archaeological investi...
PDF
Reproducible research(1)
PDF
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
PPTX
Chemistry data: Distortion and dissemination in the Internet Era
PDF
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
PPTX
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
PPTX
Chemical identification of unknowns in high resolution mass spectrometry usin...
PPTX
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
PPTX
Free online access to experimental and predicted chemical properties through ...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Profiling Linked Open Data
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
Data-mining the Semantic Web
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
Algoritmi Genetici
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Using Linked Data to Mine RDF from Wikipedia's Tables
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The Matrix: connecting and re-using digital records of archaeological investi...
Reproducible research(1)
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
Chemistry data: Distortion and dissemination in the Internet Era
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
Chemical identification of unknowns in high resolution mass spectrometry usin...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Free online access to experimental and predicted chemical properties through ...
Ad

More from Sunghoon Joo (20)

PDF
PR-445: Token Merging: Your ViT But Faster
PDF
PR-433: Test-time Training with Masked Autoencoders
PDF
PR422_hyper-deep ensembles.pdf
PDF
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PDF
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PDF
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PDF
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PDF
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PDF
PR-339: Maintaining discrimination and fairness in class incremental learning
PDF
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PDF
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PDF
PR-298 PARADE: Passage representation aggregation for document reranking
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PDF
PR-246: A deep learning system for differential diagnosis of skin diseases
PDF
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PDF
PR-218: MFAS: Multimodal Fusion Architecture Search
PDF
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PDF
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PDF
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PDF
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-445: Token Merging: Your ViT But Faster
PR-433: Test-time Training with Masked Autoencoders
PR422_hyper-deep ensembles.pdf
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-339: Maintaining discrimination and fairness in class incremental learning
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-298 PARADE: Passage representation aggregation for document reranking
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
Ad

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
Project quality management in manufacturing
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Well-logging-methods_new................
PPTX
Construction Project Organization Group 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Digital Logic Computer Design lecture notes
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
web development for engineering and engineering
PDF
PPT on Performance Review to get promotions
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Project quality management in manufacturing
additive manufacturing of ss316l using mig welding
CYBER-CRIMES AND SECURITY A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Well-logging-methods_new................
Construction Project Organization Group 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Digital Logic Computer Design lecture notes
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
573137875-Attendance-Management-System-original
Automation-in-Manufacturing-Chapter-Introduction.pdf
Internet of Things (IOT) - A guide to understanding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
web development for engineering and engineering
PPT on Performance Review to get promotions
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx

PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

  • 1. Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach PR-285 https://www.flaticon.com/kr/authors/freepik https://arxiv.org/pdf/2010.01195.pdf
  • 3. 1. Research Background Information Retrieval (IR) • Lexical approach 3/24 Inverted Index Retrieval task https://giyatto.tistory.com/2 https://devopedia.org/information-retrieval 1) Retrieval stage 2) re-ranking stage Lexical approach (ex. BM25) : Query document term
  • 4. 1. Research Background Lexical approach … • Vocabulary mismatch problem 4/24 폐쇄된 도로와 빙판 고속도로로 Idaho 에서 17 중 교통사고로 적어도 한 명의 운전자가 사망하고 , Sierra Nevada 고속도로의 빙결 구간에서 투어 버스 추락, 시애틀 근처에서 100 대의 차량 사고가 발생 Oklahoma와 South Carolina는 각각 3명의 사망자(fatalities)를 기록 Arizona, Kentucky, Missouri, Utah, Virginia에는 각각 두 명이 있 었다. 한 해 동안 한 번의 번개로 인한 사망을 기록한 것은 Washington D.C.; Kansas, Montana, North Dakota, • Lexical approach query term .
  • 5. 1. Research Background Retrieval stage semantic model 5/24 • Semantic models tend to have lower recall [CVPR 2013] • Using neural networks for retrieval had a very high cost - Query embedding document embedding MIPS Query PR-272
  • 6. 1. Research Background Related work 6/24 • Semantic approach Optimized Product Quantization [CVPR 2013] • lexical - semantic hybrid approach - BERT-based re-ranking models (Q-D relevance score BERT) QA systems, conversational agents, and product search lexical approach - BERT inverted-index - Neural network Query expansion : query - Latent Semantic Indexing (LSI) - Neural network based document embedding -> kNN based search
  • 7. 1. Research Background Objective & Approach : a hybrid retrieval approach • We propose a lexical-semantic hybrid retrieval approach . 7/24 • Commercial system hybrid • end-to-end weakly supervised approach • lexical-only approach • Lexical model, sementic model, combination model
  • 9. 2. Methods The hybrid retrieval approach 9/24 • Lexical based approach ( BM25 (Anserini toolkit); https://littlefoxdiary.tistory.com/12 ) . • Weakly supervised learning , training data . • Approximate kNN search system latency . • Open-source . Hybrid system c c c NN for semantic retrieval model BM25
  • 10. 2. Methods BERT based semantic retrieval model 10/24 NN for semantic retrieval model • BERT architecture : 6 layers, a hidden size of 256, and 4 attention heads • Training params : Adam optimizer, learning rate of 5e-4 and a batch size of 32 for 5 million training steps. • Vocabulary : 7500 words
  • 11. 2. Methods 11/24 Document – Query dataset “Weather Realted Fatalites” “Information Retrieval” … “Sunghoon Joo” BM25 50 문서 20 문서 3 문서 5 문장 추출 1. Query ( 5 tri-grams bi-grams) 2. BM25 document 10 query 3. query terms 5 : Doc-Query pairing (BERT , .) 5 문장 추출
  • 12. 2. Methods 12/24 Document – Query dataset “Weather Realted Fatalites” Over the last five years, weather-related fatalites are down 19% from 2015 … Over the last five years, Sunghoon Joo are down 19% from 2015 … 1 0 Over the last five years, traffic related fatalites are down 19% from 2015 … 1 0 1 0.65 Over the last five years, traffic related projects are down 19% from 2015 … 1 0.55 “Information Retrieval” Information dataset (IR) is finding material (usually documents) of an unstructured nature … 1 0.6
  • 13. 2. Methods 13/24 Document – Query dataset • A TREC collection (disks 1&2) • 441,676 news-wire • 51, 200 TREC topics query . • Training data set : 3.8M bi-gram queries, 1.7M tri-gram queries, and about 1B training examples (passage-query pairs).
  • 14. 2. Methods Hybrid Merging – RM3 14/24 • RM3 semantic result list . • Anserini toolkit opensource . • Query processing time Hybrid system relevance model RM3 * c c c *Nasreen Abdul-Jaleel et al., UMass at TREC 2004: Novelty and HARD https://www.cl.cam.ac.uk/teaching/1617/InfoRtrv/lecture7-relevance-feedback.pdf • c Lexical Result List RM3 , 2c c .
  • 16. 3. Experimental Results Experiment 1: Hybrid approach Lexical approach • Lexical approach semantic approach (neural model re-ranking stage ) • Hybrid approach Lexical approach c (semantic approach list ) 16/24 Lexical Result List (c) • -------- • -------- • -------- • -------- … Semantic Result List (c) • -------- • -------- • -------- • -------- … Lexical Result List에서 쿼리와 관계없는 결과를 Semantic Result List에서 대체함
  • 17. 3. Experimental Results Experiment 2: Hybrid approach 17/24 • RM3 re-ranking ? (1000 lexical list 500 RM3 ) Semantic result list 2000개 고정 → lexical result lists를 바꿔가며 실험 (∈ {500, 1500, 1000, 2000}) → Merge (최종 문서 수는 초기 lexical based results와 같게) • RM3 Merging : Hybrid approach • Semantic result set Lexical result list merge ,
  • 18. 3. Experimental Results Experiment 3: Query hybrid approach • queries hybrid approach (40%) (50%) . • Hybrid approach query robust 18/24
  • 19. 3. Experimental Results • Query set (Q1) Hybrid approach . 19/24 Experiment 3: Query hybrid approach 검색 성능에 따라 Q1, Q2, Q3, Q4 로 Query set을 구성함 df (document frequency) : (the, an, a …) idf : • Query Hybrid approach (Neural net. ) • Query (high idf) lexical model , hybrid approach
  • 20. 3. Experimental Results Experiment 4: lexical result semantic result 20/24 • Hybrid approach topic coverage . • Lexical approach Semantic approach . Tf-idf vectorization → t-SNE visualization (a) (b) (c)
  • 21. 3. Experimental Results 21/24 50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → Term list 작성 → Jaccard index 계산 • Lexical approach Semantic approach (low Jaccard index). Experiment 4: lexical result semantic result
  • 22. 3. Experimental Results 22/24 • Lexical approach (BM25). • Semantic approach . 50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → 각 250개 문서에 대해 길이 분석 Experiment 4: lexical result semantic result
  • 24. 4. Conclusions 24/24 Thank you. • Retrieval stage Lexical approach Semantic approach . • Hybrid approach , Hybrid approach Lexical, Semantic approach . • Future works: 1) Hybrid approach 2) 3) QA system, recommendation, conversational agents information retrieval task