SlideShare a Scribd company logo
4
Most read
5
Most read
20
Most read
Latest trends in AI and
Information Retrieval
- Abhay Ratnaparkhi
Outline
• Introduction
• Overview of how search engines work
• Crawling, Indexing, Querying, Ranking
• Open-source solutions and products
• Real world problems
• Extracting text from HTML
• Ranking documents – Learning to Rank
• Formulating better query – Relevance Feedback
• Feature Snippet - Automated Question Answer Generation
• Federated Search
• Finding Near duplicates from large set of documents
• Neural Information Retrieval – Trends
• Local vs distributed representations
• Query document matching
• Query Expansion
• Working in software industry
• Job roles
• Software Development processes
• Skills you need
What is information retrieval?
• Finding material of an unstructured nature that satisfies
an information need from within large collections.
• Search Engines
• Question Answering systems
• Recommendation systems
Expert Systems - IBM Watson DeepQA
https://www.aaai.org/Magazine/Watson/watson.php
IBM Watson DeepQA system outperforms human in
Jeopardy Challenge - 2011
Search is an integral part of such QA systems
Virtual Assistant - Amazon Alexa
Alexa, What’s the India’s current score?
Alexa, Play Marathi song?
Search is required to answer questions related to
most of the skills
Search Engines
How Search Works?
Open Source
Web Search
Pr
et
Given a query `q’ find matching set of documents `d `
Insight Engines
IBM Watson
Discovery
Web Crawler
• Finding Web pages on the web by recursively visiting linked pages from
some seed URLs.
• Crawling at scale – Needs distributed system
• Apache Nutch, StormCrawler, Scrapy, Sparkler
• Storing crawled content
• Server-side rendering vs Client-side rendering
• Googlebot uses headless chrome to render pages.
• Google Puppeteer
• Link Analysis- Finding page importance – PageRank
• Getting features like Page speed, mobile friendliness, content quality etc.
• Deep Web – Portion of web not accessible to crawler - ~90%
Inverted Index
• Ranking functions
• Term Frequency (tf) X Inverse
Document Frequency (idf)
• Okapi BM25
• Details about lucene inverted
index
Source: - https://nlp.stanford.edu/IR-
book/html/htmledition/an-example-information-retrieval-
problem-1.html#1533
Real World Problems
Extracting clean text from a web page
• Remove unnecessary information like
headers, footers, advertisements etc.
• Boilerplate content deteriorate search
precision
• CLEANEVAL. - Competitive evaluation on
the topic of cleaning arbitrary web pages
• Using shallow text features – 2010
• http://www.l3s.de/~kohlschuetter/boilerplat
e/WSDM2010-Kohlschuetter-slides.pdf
• Web2Text: Deep Structured Boilerplate
Removal
Source - https://arxiv.org/abs/1801.02607
Learning to Rank – How to measure
relevancy?
• Human Annotators - Give relevancy labels to
the documents manually by many annotators
• Automated Ways - Observer Click patterns
and other metrics on Search Engine Results
Page (SERP). Click Models
• Relevancy metrics
• Precision: is the fraction of
results that are relevant
• Recall: is the fraction of
relevant results that are
returned
• nDCG : Normalized
Discounted Cumulative Gain -
This metric asserts that the
highly relevant documents are
more useful than moderately
relevant documents, which are
in turn more useful than
irrelevant documents.
• E. g. if documents given
labels from 0 to 5.
• {5, 5, 4, 3, 0} - High nDCG
Reranking using - Learning to Rank
• Ranking model
• The model is trained using labels
• Aim is to Maximize nDCG
• Pair wise, point wise and list wise approaches
• https://www.cl.cam.ac.uk/teaching/1516/R222/l
2r-overview.pdf
• RankNet, LamdaRank, LambdaMart
Document Label Orig
score
BM25 -
title
Page
Rank
#Visits
ibm
products
www.ibm.com 4 2.3 2.0 3 200K
www.ibm.com/products 5 2.4 3.0 2 10K
www.microsoft.com 2 2.1 1.1 3 300K
Relevance Feedback and Query Expansion
Relevance Feedback (local analysis)
Pseudo Relevance Feedback – Automated way to change query
considering top retrieved documents are relevant
Query Expansion (Global analysis)
Feature Snippets & Automated QA generation
• Natural Language Generation
• Stanford Question Answer Dataset (SQuAD)
https://www.coursera.org/specializations/natural-language-
processing#courses
• Transfer learning – Use the model with little retraining
in other domains.
• Transformer based models – BERT, GPT-3, LaMDA
Federated/Aggregated Search
• Resource selection (or query
intent prediction).
• Result aggregation
• if w1, w2, w3, w4, w5 are the
web results, we can constrain
the vertical result blocks to end
up in one of the slots s1,s2, s3
that are distributed in a
following way among the web
results: s1, w1, s2, w2, w3, w4,
w5, s3.
Finding near duplicate documents
• Document similarity
• Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
• Set b = new Set(["chair", "rug", "keyboard"]);
• Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60%
• MinHash (Locality Sensitive Hashing)
• Intelligent mechanism to reduce big data to smaller
hash values for easy similarity computations
• Mining Massive Datasets
• http://www.mmds.org/#book
Neural Information Retrieval
• Neural IR is the application of shallow or deep neural networks to IR tasks.
• Other natural language processing capabilities such as machine translation and named entity linking are
not neural IR but could be used in an IR system.
Neural IR models can be categorized based on whether they influence the query representation,
the document representation, the relevance estimation, or a combination of these steps.
Source – Neural IR
Neural Information Retrieval
Source – Neural IR
Word Embeddings
learn an embedding from words into vectors
Need to have a function W(word) that returns a vector encoding that word.
Relationships between words correspond to difference
between vectors.
Word2vec, GloVe
“a word is characterized by the company it keeps”
Vector Search Engines
• Weaviate
• Milvus
• Approximate Nearest Neighbors
Search
Working in software industry
Job Roles
• Software Developer
• Full Stack Developer
• Machine Learning Engineer
• Data Scientist
• Site Reliability Engineer
• Software Architect
• Front End Developer
• UX Designer
• Iteration Manager
• Scrum Master
• Product Owner
• Research Staff Member
• People Manager Agile Software development

More Related Content

PPT
Encoded Archival Description (EAD)
PPTX
Information retrieval 7 boolean model
PPTX
Controlled Vocabullary.pptx
PPTX
AGRIS (agricultural information system)
PPT
Meta Search Engine: An Introductory Study
PPTX
Taxonomy, ontology, folksonomies & SKOS.
PPTX
Dspace
PPTX
Automatic indexing
Encoded Archival Description (EAD)
Information retrieval 7 boolean model
Controlled Vocabullary.pptx
AGRIS (agricultural information system)
Meta Search Engine: An Introductory Study
Taxonomy, ontology, folksonomies & SKOS.
Dspace
Automatic indexing

What's hot (20)

PPT
Chain indexing
PPT
PPT
Informetrics final
PPTX
International Digital Library Initiatives
PPTX
Planning and Designing an Information System.pptx
PPTX
Absolute syntax
PPTX
Information Retrieval Evaluation
PPTX
Post coordinate indexing .. Library and information science
PPTX
Precis
PPTX
PPT
Common communication format
PPTX
Ontology and Ontology Libraries: a Critical Study
DOCX
key word indexing and their types with example
PDF
Indexing language concept types and characteristics
PPT
FRBR model by Gaurav Boudh
PPT
Metadata: A concept
PPTX
National social science documentation centre (nassdoc )
PPTX
Uniterm indexing
PPTX
POPSI
PPTX
Semantic web
Chain indexing
Informetrics final
International Digital Library Initiatives
Planning and Designing an Information System.pptx
Absolute syntax
Information Retrieval Evaluation
Post coordinate indexing .. Library and information science
Precis
Common communication format
Ontology and Ontology Libraries: a Critical Study
key word indexing and their types with example
Indexing language concept types and characteristics
FRBR model by Gaurav Boudh
Metadata: A concept
National social science documentation centre (nassdoc )
Uniterm indexing
POPSI
Semantic web
Ad

Similar to Latest trends in AI and information Retrieval (20)

PPT
Information Retrieval and Storage Systems
PPTX
Introduction to Information Retrieval
PPTX
Introduction to Information Retrieval (concepts and principles)
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PPTX
Information retrival system and PageRank algorithm
PPTX
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
PDF
Information Retrieval
PPT
Introduction into Search Engines and Information Retrieval
PPTX
Semantic Search tutorial at SemTech 2012
PPS
How web searching engines work
PDF
Information Retrieval Fundamentals - An introduction
PPTX
Info 2402 irt-chapter_2
PPT
Web search engines
PPT
search engine
PDF
Tutorial 1 (information retrieval basics)
PPTX
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
PPTX
Week14-Multimedia Information Retrieval.pptx
PPTX
Semantic Search at Yahoo
PDF
Improving search with neural ranking methods
PDF
Webinar: Modern Techniques for Better Search Relevance with Fusion
Information Retrieval and Storage Systems
Introduction to Information Retrieval
Introduction to Information Retrieval (concepts and principles)
Information_Retrieval_Models_Nfaoui_El_Habib
Information retrival system and PageRank algorithm
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Information Retrieval
Introduction into Search Engines and Information Retrieval
Semantic Search tutorial at SemTech 2012
How web searching engines work
Information Retrieval Fundamentals - An introduction
Info 2402 irt-chapter_2
Web search engines
search engine
Tutorial 1 (information retrieval basics)
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
Week14-Multimedia Information Retrieval.pptx
Semantic Search at Yahoo
Improving search with neural ranking methods
Webinar: Modern Techniques for Better Search Relevance with Fusion
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
KodekX | Application Modernization Development

Latest trends in AI and information Retrieval

  • 1. Latest trends in AI and Information Retrieval - Abhay Ratnaparkhi
  • 2. Outline • Introduction • Overview of how search engines work • Crawling, Indexing, Querying, Ranking • Open-source solutions and products • Real world problems • Extracting text from HTML • Ranking documents – Learning to Rank • Formulating better query – Relevance Feedback • Feature Snippet - Automated Question Answer Generation • Federated Search • Finding Near duplicates from large set of documents • Neural Information Retrieval – Trends • Local vs distributed representations • Query document matching • Query Expansion • Working in software industry • Job roles • Software Development processes • Skills you need
  • 3. What is information retrieval? • Finding material of an unstructured nature that satisfies an information need from within large collections. • Search Engines • Question Answering systems • Recommendation systems
  • 4. Expert Systems - IBM Watson DeepQA https://www.aaai.org/Magazine/Watson/watson.php IBM Watson DeepQA system outperforms human in Jeopardy Challenge - 2011 Search is an integral part of such QA systems
  • 5. Virtual Assistant - Amazon Alexa Alexa, What’s the India’s current score? Alexa, Play Marathi song? Search is required to answer questions related to most of the skills
  • 7. How Search Works? Open Source Web Search Pr et Given a query `q’ find matching set of documents `d ` Insight Engines IBM Watson Discovery
  • 8. Web Crawler • Finding Web pages on the web by recursively visiting linked pages from some seed URLs. • Crawling at scale – Needs distributed system • Apache Nutch, StormCrawler, Scrapy, Sparkler • Storing crawled content • Server-side rendering vs Client-side rendering • Googlebot uses headless chrome to render pages. • Google Puppeteer • Link Analysis- Finding page importance – PageRank • Getting features like Page speed, mobile friendliness, content quality etc. • Deep Web – Portion of web not accessible to crawler - ~90%
  • 9. Inverted Index • Ranking functions • Term Frequency (tf) X Inverse Document Frequency (idf) • Okapi BM25 • Details about lucene inverted index Source: - https://nlp.stanford.edu/IR- book/html/htmledition/an-example-information-retrieval- problem-1.html#1533
  • 11. Extracting clean text from a web page • Remove unnecessary information like headers, footers, advertisements etc. • Boilerplate content deteriorate search precision • CLEANEVAL. - Competitive evaluation on the topic of cleaning arbitrary web pages • Using shallow text features – 2010 • http://www.l3s.de/~kohlschuetter/boilerplat e/WSDM2010-Kohlschuetter-slides.pdf • Web2Text: Deep Structured Boilerplate Removal Source - https://arxiv.org/abs/1801.02607
  • 12. Learning to Rank – How to measure relevancy? • Human Annotators - Give relevancy labels to the documents manually by many annotators • Automated Ways - Observer Click patterns and other metrics on Search Engine Results Page (SERP). Click Models • Relevancy metrics • Precision: is the fraction of results that are relevant • Recall: is the fraction of relevant results that are returned • nDCG : Normalized Discounted Cumulative Gain - This metric asserts that the highly relevant documents are more useful than moderately relevant documents, which are in turn more useful than irrelevant documents. • E. g. if documents given labels from 0 to 5. • {5, 5, 4, 3, 0} - High nDCG
  • 13. Reranking using - Learning to Rank • Ranking model • The model is trained using labels • Aim is to Maximize nDCG • Pair wise, point wise and list wise approaches • https://www.cl.cam.ac.uk/teaching/1516/R222/l 2r-overview.pdf • RankNet, LamdaRank, LambdaMart Document Label Orig score BM25 - title Page Rank #Visits ibm products www.ibm.com 4 2.3 2.0 3 200K www.ibm.com/products 5 2.4 3.0 2 10K www.microsoft.com 2 2.1 1.1 3 300K
  • 14. Relevance Feedback and Query Expansion Relevance Feedback (local analysis) Pseudo Relevance Feedback – Automated way to change query considering top retrieved documents are relevant Query Expansion (Global analysis)
  • 15. Feature Snippets & Automated QA generation • Natural Language Generation • Stanford Question Answer Dataset (SQuAD) https://www.coursera.org/specializations/natural-language- processing#courses • Transfer learning – Use the model with little retraining in other domains. • Transformer based models – BERT, GPT-3, LaMDA
  • 16. Federated/Aggregated Search • Resource selection (or query intent prediction). • Result aggregation • if w1, w2, w3, w4, w5 are the web results, we can constrain the vertical result blocks to end up in one of the slots s1,s2, s3 that are distributed in a following way among the web results: s1, w1, s2, w2, w3, w4, w5, s3.
  • 17. Finding near duplicate documents • Document similarity • Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]); • Set b = new Set(["chair", "rug", "keyboard"]); • Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60% • MinHash (Locality Sensitive Hashing) • Intelligent mechanism to reduce big data to smaller hash values for easy similarity computations • Mining Massive Datasets • http://www.mmds.org/#book
  • 18. Neural Information Retrieval • Neural IR is the application of shallow or deep neural networks to IR tasks. • Other natural language processing capabilities such as machine translation and named entity linking are not neural IR but could be used in an IR system. Neural IR models can be categorized based on whether they influence the query representation, the document representation, the relevance estimation, or a combination of these steps. Source – Neural IR
  • 20. Word Embeddings learn an embedding from words into vectors Need to have a function W(word) that returns a vector encoding that word. Relationships between words correspond to difference between vectors. Word2vec, GloVe “a word is characterized by the company it keeps”
  • 21. Vector Search Engines • Weaviate • Milvus • Approximate Nearest Neighbors Search
  • 23. Job Roles • Software Developer • Full Stack Developer • Machine Learning Engineer • Data Scientist • Site Reliability Engineer • Software Architect • Front End Developer • UX Designer • Iteration Manager • Scrum Master • Product Owner • Research Staff Member • People Manager Agile Software development