SlideShare a Scribd company logo
CONCEPTS AND CHALLENGES
OFTEXT RETRIEVAL
FOR SEARCH ENGINES
PRE CONFERENCETUTORIAL
by Gan Keng Hoon
16th August 2016
1
THISTUTORIAL
Overview:Text Retrieval & Search Engine
Concept : Basics ofText Retrieval
Challenges: Semantics & Specific
Case: Expert Search Engine
2
Search
3
What Do People Search for?
FuYuanhui
How to get free Pokeball ?
How to write thesis in three month ?
keynote speaker ICAICTA 2016
4
What Do People Expect ?
How to get free Pokeball
5
Behind the
Click?
6
Quiz:Which one is not a Search Engine?
7
Type of Search Engine
Web Search Engine
Google,Yahoo, Bing
Domain Specific Search Engine
Medline/Pubmed
Microsoft Academic
Desktop Search Engine
Copernic
8
ConnectingTwo Ends
Search
Collection
 Web
 Domain
Specific
 Personal
 Enterprise
Etc.
Information
Needs
I want to know more
about the keynotes
speech of ICAICTA
2016.
I need more
Pokeballs
Free Of
Charge..…
What’s so funny
about FuYuan
Hui??
Scholarship
ending soon,
three months
left to submit
my thesis….
 Web Sites
 Journal
Articles
 News
 Images
 Videos
 Audio
 Scanned
Documents
 Tweets
 Posts
 Reviews
 Etc…
9
A Conceptual Model forText Retrieval
Information Needs
Query
Search Collection
Document
Representation
Retrieved
Documents
Indexing
Formulation
Retrieval Function
Relevance Feedback
Natural Language
Content Analysis
10
Natural Language Content Analysis
11
SearchCollection (Retrieval Unit)
Web pages, email, books, news stories, scholarly
papers, text messages,Word™, Powerpoint™, PDF,
forum postings, patents, etc.
Retrieval unit can be
Part of document, e.g. a paragraph, a slide, a page etc.
In the form different structure, html, xml, text etc.
In different sizes/length.
12
Document Representation
FullText Representation
Keep everything. Complete.
Require huge resources.Too much may not be good.
Reduced (partial) Content Representation
Remove not important contents e.g. stopwords.
Standardization to reduce overlapped contents e.g. stemming.
Retain only important contents, e.g. noun phrases, header etc.
13
Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words,
disregarding grammar and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
14
Information Needs & Query
Information Needs != Query
Recall the information needs
Query: icaicta 2016 keynote
Information Need: I want to know more about the keynotes speech of
ICAICTA 2016
Query: free pokeball
Information Need: I need more Pokeballs. I don’t want to pay. No cheat
codes.
15
Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
SimpleTerm Matching Approach
1. Compare the terms in a document and query.
2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the
query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
16
Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
Many variety of words used in texts,
but not all are important.
Among the important words, some
are more contextually relevant.
Some basic processes
involved
•Tokenization
•StopWords Removal
•Stemming
•Phrases
•Inverted File
17
Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report
showed profits rose 10%.”
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
18
Indexing (Tokenization)
Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ….
or
19
Indexing (Tokenization)
Common Issues
1. Capitalized words can have different meaning from lower case words
Bush fires the officer. Query: Bush fire
The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's
degree, england's ten largest cities, shriner's
20
Indexing (Tokenization)
3. Numbers can be important, including decimals
nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the
beat, 288358
4. Periods can occur in numbers, abbreviations, URLs, ends of
sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for
documents
21
Indexing (Stopping)
Top 50 Words from AP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
22
Indexing (Stopping)
Stopword list can be created from high-frequency words or based
on a standard list
Lists are customized for applications, domains, and even parts of
documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions
about which words to use at query time?
23
Indexing (Stemming)
Many morphological variations of words
inflectional (plurals, tenses)
derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words
to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
24
Indexing (Stemming)
Porter Stemmer
Algorithmic stemmer used in
IR experiments since the 70s
Consists of a series of rules
designed to the longest
possible suffix at each step
Produces stems not words
Example Step 1 (right figure)
25
Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:
Identify syntactic phrases using a part-of-speech (POS) tagger
Use word n-grams
Store word positions in indexes and use proximity operators in
queries
26
Indexing (Phrases)
Example Noun Phrases
* Other method like N-Gram
27
Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and
other information.
Each entry is called a posting.
The part of the posting that refers to a specific document or location
is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)
28
Indexing (Inverted Index)
Sample collection. 4 sentences fromWikipedia entry for Tropical
Fish
29
Indexing (Inverted Index)
Simple inverted index.
30
Indexing (Inverted Index)
Inverted index with
counts.
Support better
ranking algorithms.
31
Indexing
(Inverted Index)
Inverted index with
positions.
Support proximity
matching.
32
Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score
computing using the document representation, the query, and a
ranking algorithm
33
Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term
weights.
Collection represented by a matrix of term weights.
34
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of useful terms
35
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
Note: Doc Length,Term Location,Term Semantic Meaning
36
Retrieval Function (Vector Space Model)
Documents ranked by distance between points
representing query and documents
Similarity measure more common than a distance or dissimilarity
measure
e.g. Cosine correlation
37
Retrieval Function (Vector Space Model)
Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection, D1 = “new straits times”
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 =
0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is
more relevant?
38
Evaluation
A must to evaluate the retrieval function, preprocessing
steps etc.
StandardCollection
Task specific
Human experts are used to judge relevant results.
Performance Metric
Precision
Recall
39
Evaluation (Collection)
Test collections consisting of documents, queries, and relevance
judgments, e.g.,
40
Evaluation (Collection)
Example query and
narrative for golden
standard.
41
Evaluation (Effectiveness Measures)
A is set of relevant documents,
B is set of retrieved documents
42
Evaluation (Ranking Effectiveness)
43
Evaluation (Ranking Effectiveness)
Recall@4 = 3/4
Precision@4 = 3/4
Recall@2 = 2/4
Precision@2 = 2/2 44
Challenges
SocialTexts,
e.g.Tweets,
Posts
Hard question.
Hard Disk ?
Named Entity 
Various levels and
aspects of
annotations
45
Challenges
Small Data
Specific search
Improve semantics extensively
Big Data
Multi modal retrieval
Connecting many medias
46
Case: Adding Semantics Bibliography
Improve Search Results Display
Facet-based
semantic
UsefulTerms
Demo: ir.cs.usm.my
THANKYOU
khgan@usm.my
49

More Related Content

PDF
Information retrieval concept, practice and challenge
PPTX
Text Data Mining
PPTX
Techniques of information retrieval
PDF
Information Extraction
PPT
Boolean Retrieval
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
PPTX
Information Retrieval
Information retrieval concept, practice and challenge
Text Data Mining
Techniques of information retrieval
Information Extraction
Boolean Retrieval
Reflected Intelligence: Lucene/Solr as a self-learning data system
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Information Retrieval

What's hot (20)

PPTX
Information Extraction
PDF
Text Mining Analytics 101
PPT
Text mining
PPTX
Introduction to Text Mining
PDF
Language Models for Information Retrieval
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
PDF
Question Answering over Linked Data (Reasoning Web Summer School)
PDF
Measuring Relevance in the Negative Space
PDF
SA2: Text Mining from User Generated Content
PPTX
Searching for Meaning
PPTX
The Intent Algorithms of Search & Recommendation Engines
PPTX
Text data mining1
PPT
Information extraction for Free Text
PDF
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
PDF
The Next Generation of AI-powered Search
PDF
Enhancing relevancy through personalization & semantic search
PDF
Topic Modeling - NLP
PPT
Text Mining
PDF
Lecture: Summarization
Information Extraction
Text Mining Analytics 101
Text mining
Introduction to Text Mining
Language Models for Information Retrieval
Natural Language Search with Knowledge Graphs (Haystack 2019)
Crowdsourced query augmentation through the semantic discovery of domain spec...
Question Answering over Linked Data (Reasoning Web Summer School)
Measuring Relevance in the Negative Space
SA2: Text Mining from User Generated Content
Searching for Meaning
The Intent Algorithms of Search & Recommendation Engines
Text data mining1
Information extraction for Free Text
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
The Next Generation of AI-powered Search
Enhancing relevancy through personalization & semantic search
Topic Modeling - NLP
Text Mining
Lecture: Summarization
Ad

Similar to Concepts and Challenges of Text Retrieval for Search Engine (20)

PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PPT
Information Retrieval
PDF
Information Retrieval and Map-Reduce Implementations
PPTX
01 IRS-1 (1) document upload the link to
PPTX
01 IRS to upload the data according to the.pptx
PPTX
Retrieval approches
PPTX
Introduction to search engine-building with Lucene
PDF
Search pitb
PDF
Shilpa shukla processing_text
PPTX
Introduction to search engine-building with Lucene
PPTX
Week14-Multimedia Information Retrieval.pptx
PPT
Web search engines
PPT
Cs583 info-retrieval
PPTX
Chapter 1 Intro Information Rerieval.pptx
PDF
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
PPT
Tovek Presentation by Livio Costantini
PPTX
Introduction to Information Retrieval (concepts and principles)
PPT
2_Capability.ppt
PPTX
Text mining
14. Michael Oakes (UoW) Natural Language Processing for Translation
Information Retrieval
Information Retrieval and Map-Reduce Implementations
01 IRS-1 (1) document upload the link to
01 IRS to upload the data according to the.pptx
Retrieval approches
Introduction to search engine-building with Lucene
Search pitb
Shilpa shukla processing_text
Introduction to search engine-building with Lucene
Week14-Multimedia Information Retrieval.pptx
Web search engines
Cs583 info-retrieval
Chapter 1 Intro Information Rerieval.pptx
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
Tovek Presentation by Livio Costantini
Introduction to Information Retrieval (concepts and principles)
2_Capability.ppt
Text mining
Ad

More from Gan Keng Hoon (16)

PDF
A View of Text Analytics from Word, Sentence and Document Levels
PDF
Keywords Discovery with Simple Text Mining using R
PDF
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
PDF
Procrastination and Phd.pdf
PDF
Guest Lecture for Principles of Data Analytics.pdf
PDF
Knowledge Representation Reasoning and Acquisition.pdf
PDF
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
PDF
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
PDF
Text and Sentiment Analytics for Business Intelligence
PDF
Category & Training Texts Selection for Scientific Article Categorization in ...
PDF
Semantics in Retrieval
PDF
Faceted Search for Finding Expertise Bibliographies
PDF
ACIS 2015 Bibliographical-based Facets for Expertise Search
PPTX
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
PDF
Wi 2015 demo_preview
PDF
An overview of text mining and sentiment analysis for Decision Support System
A View of Text Analytics from Word, Sentence and Document Levels
Keywords Discovery with Simple Text Mining using R
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
Procrastination and Phd.pdf
Guest Lecture for Principles of Data Analytics.pdf
Knowledge Representation Reasoning and Acquisition.pdf
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Text and Sentiment Analytics for Business Intelligence
Category & Training Texts Selection for Scientific Article Categorization in ...
Semantics in Retrieval
Faceted Search for Finding Expertise Bibliographies
ACIS 2015 Bibliographical-based Facets for Expertise Search
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
Wi 2015 demo_preview
An overview of text mining and sentiment analysis for Decision Support System

Recently uploaded (20)

PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
Primary and secondary sources, and history
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Introduction-to-Food-Packaging-and-packaging -materials.pptx
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
Anesthesia and it's stage with mnemonic and images
PPTX
Hydrogel Based delivery Cancer Treatment
PPTX
lesson6-211001025531lesson plan ppt.pptx
PPTX
worship songs, in any order, compilation
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPT
The Effect of Human Resource Management Practice on Organizational Performanc...
PDF
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
PPTX
Human Mind & its character Characteristics
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
chapter8-180915055454bycuufucdghrwtrt.pptx
Impressionism_PostImpressionism_Presentation.pptx
Primary and secondary sources, and history
Tablets And Capsule Preformulation Of Paracetamol
An Unlikely Response 08 10 2025.pptx
Introduction-to-Food-Packaging-and-packaging -materials.pptx
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
Anesthesia and it's stage with mnemonic and images
Hydrogel Based delivery Cancer Treatment
lesson6-211001025531lesson plan ppt.pptx
worship songs, in any order, compilation
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
The Effect of Human Resource Management Practice on Organizational Performanc...
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
Human Mind & its character Characteristics
oil_refinery_presentation_v1 sllfmfls.pdf
_ISO_Presentation_ISO 9001 and 45001.pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证

Concepts and Challenges of Text Retrieval for Search Engine

  • 1. CONCEPTS AND CHALLENGES OFTEXT RETRIEVAL FOR SEARCH ENGINES PRE CONFERENCETUTORIAL by Gan Keng Hoon 16th August 2016 1
  • 2. THISTUTORIAL Overview:Text Retrieval & Search Engine Concept : Basics ofText Retrieval Challenges: Semantics & Specific Case: Expert Search Engine 2
  • 4. What Do People Search for? FuYuanhui How to get free Pokeball ? How to write thesis in three month ? keynote speaker ICAICTA 2016 4
  • 5. What Do People Expect ? How to get free Pokeball 5
  • 7. Quiz:Which one is not a Search Engine? 7
  • 8. Type of Search Engine Web Search Engine Google,Yahoo, Bing Domain Specific Search Engine Medline/Pubmed Microsoft Academic Desktop Search Engine Copernic 8
  • 9. ConnectingTwo Ends Search Collection  Web  Domain Specific  Personal  Enterprise Etc. Information Needs I want to know more about the keynotes speech of ICAICTA 2016. I need more Pokeballs Free Of Charge..… What’s so funny about FuYuan Hui?? Scholarship ending soon, three months left to submit my thesis….  Web Sites  Journal Articles  News  Images  Videos  Audio  Scanned Documents  Tweets  Posts  Reviews  Etc… 9
  • 10. A Conceptual Model forText Retrieval Information Needs Query Search Collection Document Representation Retrieved Documents Indexing Formulation Retrieval Function Relevance Feedback Natural Language Content Analysis 10
  • 12. SearchCollection (Retrieval Unit) Web pages, email, books, news stories, scholarly papers, text messages,Word™, Powerpoint™, PDF, forum postings, patents, etc. Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length. 12
  • 13. Document Representation FullText Representation Keep everything. Complete. Require huge resources.Too much may not be good. Reduced (partial) Content Representation Remove not important contents e.g. stopwords. Standardization to reduce overlapped contents e.g. stemming. Retain only important contents, e.g. noun phrases, header etc. 13
  • 14. Document Representation Think of representation as some ways of storing the document. Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order. Document 1: "The cat sat on the hat" Document 2: "The dog ate the cat and the hat" From these two documents, a word list is constructed: { the, cat, sat, on, hat, dog, ate, and } The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 14
  • 15. Information Needs & Query Information Needs != Query Recall the information needs Query: icaicta 2016 keynote Information Need: I want to know more about the keynotes speech of ICAICTA 2016 Query: free pokeball Information Need: I need more Pokeballs. I don’t want to pay. No cheat codes. 15
  • 16. Retrieved Documents From the original collection, a subset of documents are obtained. What is the factor that determines what document to return? SimpleTerm Matching Approach 1. Compare the terms in a document and query. 2. Compute “similarity” between each document in the collection and the query based on the terms they have in common. 3. Sorting the document in order of decreasing similarity with the query. 4. The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system. 16
  • 17. Indexing Convert documents into representation or data structure to improve the efficiency of retrieval. To generate a set of useful terms called indexes. Why? Many variety of words used in texts, but not all are important. Among the important words, some are more contextually relevant. Some basic processes involved •Tokenization •StopWords Removal •Stemming •Phrases •Inverted File 17
  • 18. Indexing (Tokenization) Convert a sequence of characters into a sequence of tokens with some basic meaning. “The cat chases the mouse.” “Bigcorp's 2007 bi-annual report showed profits rose 10%.” the cat chases the mouse bigcorp 2007 bi annual report showed profits rose 10% 18
  • 19. Indexing (Tokenization) Token can be single or multiple terms. “Samsung Galaxy S7 Edge, redefines what a phone can do.” samsung galaxy s7 edge redefines what a phone can do samsung galaxy s7 edge redefines what a …. or 19
  • 20. Indexing (Tokenization) Common Issues 1. Capitalized words can have different meaning from lower case words Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire 2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's 20
  • 21. Indexing (Tokenization) 3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358 4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations I.B.M., Ph.D., cs.umass.edu, F.E.A.R. Note: tokenizing steps for queries must be identical to steps for documents 21
  • 22. Indexing (Stopping) Top 50 Words from AP89 News Collection Recall, Indexes should be useful term links to a document. Are the terms on the right figure useful? 22
  • 23. Indexing (Stopping) Stopword list can be created from high-frequency words or based on a standard list Lists are customized for applications, domains, and even parts of documents e.g., “click” is a good stopword for anchor text Best policy is to index all words in documents, make decisions about which words to use at query time? 23
  • 24. Indexing (Stemming) Many morphological variations of words inflectional (plurals, tenses) derivational (making verbs nouns etc.) In most cases, these have the same or very similar meanings Stemmers attempt to reduce morphological variations of words to a common stem usually involves removing suffixes Can be done at indexing time or as part of query processing (like stopwords) 24
  • 25. Indexing (Stemming) Porter Stemmer Algorithmic stemmer used in IR experiments since the 70s Consists of a series of rules designed to the longest possible suffix at each step Produces stems not words Example Step 1 (right figure) 25
  • 26. Indexing (Phrases) Recall, token, meaningful tokens are better indexes, e.g. phrases. Text processing issue – how are phrases recognized? Three possible approaches: Identify syntactic phrases using a part-of-speech (POS) tagger Use word n-grams Store word positions in indexes and use proximity operators in queries 26
  • 27. Indexing (Phrases) Example Noun Phrases * Other method like N-Gram 27
  • 28. Indexing (Inverted Index) Recall, indexes are designed to support search. Each index term is associated with an inverted list Contains lists of documents, or lists of word occurrences in documents, and other information. Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered (sorted by document number) 28
  • 29. Indexing (Inverted Index) Sample collection. 4 sentences fromWikipedia entry for Tropical Fish 29
  • 30. Indexing (Inverted Index) Simple inverted index. 30
  • 31. Indexing (Inverted Index) Inverted index with counts. Support better ranking algorithms. 31
  • 32. Indexing (Inverted Index) Inverted index with positions. Support proximity matching. 32
  • 33. Retrieval Function Ranking Documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm 33
  • 34. Retrieval Function (Vector Space Model) Ranked based method. Documents and query represented by a vector of term weights. Collection represented by a matrix of term weights. 34
  • 35. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 1 0 1 1 D2 0 1 1 0 1 0 D3 1 0 0 1 0 1 D1: new straits times D2: new straits daily D3 : north borneo times Vector of useful terms 35
  • 36. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 0.176 0 0.176 0.176 D2 0 0.477 0.176 0 0.176 0 D3 0.477 0 0 0.477 0 0.176 idf (borneo) = log(3/1) =0.477 idf (daily) = log(3/1) = 0.477 idf (new) = log(3/2) =0.176 idf (north) = log(3/1) = 0.477 idf (straits) = log(3/2) = 0.176 idf (times) = log(3/2) = 0.176 then multiply by tf tf.idf weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: Note: Doc Length,Term Location,Term Semantic Meaning 36
  • 37. Retrieval Function (Vector Space Model) Documents ranked by distance between points representing query and documents Similarity measure more common than a distance or dissimilarity measure e.g. Cosine correlation 37
  • 38. Retrieval Function (Vector Space Model) Consider two documents D1, D2 and a query Q Q = “straits times” Compare against collection, D1 = “new straits times” (borneo, daily, new, north, straits, times) Q = (0, 0, 0, 0, 0.176, 0.176) D1 = (0, 0, 0.176, 0, 0.176, 0.176) D2 = (0, 0.477, 0.176, 0, 0.176, 0) 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 = 0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176) 0.1762 +0.1762 +0.1762 (0.1762 +0.1762 ) =0.816 Find Cosine (D2,Q). Which document is more relevant? 38
  • 39. Evaluation A must to evaluate the retrieval function, preprocessing steps etc. StandardCollection Task specific Human experts are used to judge relevant results. Performance Metric Precision Recall 39
  • 40. Evaluation (Collection) Test collections consisting of documents, queries, and relevance judgments, e.g., 40
  • 41. Evaluation (Collection) Example query and narrative for golden standard. 41
  • 42. Evaluation (Effectiveness Measures) A is set of relevant documents, B is set of retrieved documents 42
  • 44. Evaluation (Ranking Effectiveness) Recall@4 = 3/4 Precision@4 = 3/4 Recall@2 = 2/4 Precision@2 = 2/2 44
  • 45. Challenges SocialTexts, e.g.Tweets, Posts Hard question. Hard Disk ? Named Entity  Various levels and aspects of annotations 45
  • 46. Challenges Small Data Specific search Improve semantics extensively Big Data Multi modal retrieval Connecting many medias 46
  • 47. Case: Adding Semantics Bibliography
  • 48. Improve Search Results Display Facet-based semantic UsefulTerms Demo: ir.cs.usm.my