Concepts and Challenges of Text Retrieval for Search Engine

CONCEPTS AND CHALLENGES
OFTEXT RETRIEVAL
FOR SEARCH ENGINES
PRE CONFERENCETUTORIAL
by Gan Keng Hoon
16th August 2016
1

THISTUTORIAL
Overview:Text Retrieval & Search Engine
Concept : Basics ofText Retrieval
Challenges: Semantics & Specific
Case: Expert Search Engine
2

What Do People Search for?
FuYuanhui
How to get free Pokeball ?
How to write thesis in three month ?
keynote speaker ICAICTA 2016
4

What Do People Expect ?
How to get free Pokeball
5

Quiz:Which one is not a Search Engine?
7

Type of Search Engine
Web Search Engine
Google,Yahoo, Bing
Domain Specific Search Engine
Medline/Pubmed
Microsoft Academic
Desktop Search Engine
Copernic
8

ConnectingTwo Ends
Search
Collection
 Web
 Domain
Specific
 Personal
 Enterprise
Etc.
Information
Needs
I want to know more
about the keynotes
speech of ICAICTA
2016.
I need more
Pokeballs
Free Of
Charge..…
What’s so funny
about FuYuan
Hui??
Scholarship
ending soon,
three months
left to submit
my thesis….
 Web Sites
 Journal
Articles
 News
 Images
 Videos
 Audio
 Scanned
Documents
 Tweets
 Posts
 Reviews
 Etc…
9

A Conceptual Model forText Retrieval
Information Needs
Query
Search Collection
Document
Representation
Retrieved
Documents
Indexing
Formulation
Retrieval Function
Relevance Feedback
Natural Language
Content Analysis
10

Natural Language Content Analysis
11

SearchCollection (Retrieval Unit)
Web pages, email, books, news stories, scholarly
papers, text messages,Word™, Powerpoint™, PDF,
forum postings, patents, etc.
Retrieval unit can be
Part of document, e.g. a paragraph, a slide, a page etc.
In the form different structure, html, xml, text etc.
In different sizes/length.
12

Document Representation
FullText Representation
Keep everything. Complete.
Require huge resources.Too much may not be good.
Reduced (partial) Content Representation
Remove not important contents e.g. stopwords.
Standardization to reduce overlapped contents e.g. stemming.
Retain only important contents, e.g. noun phrases, header etc.
13

Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words,
disregarding grammar and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
14

Information Needs & Query
Information Needs != Query
Recall the information needs
Query: icaicta 2016 keynote
Information Need: I want to know more about the keynotes speech of
ICAICTA 2016
Query: free pokeball
Information Need: I need more Pokeballs. I don’t want to pay. No cheat
codes.
15

Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
SimpleTerm Matching Approach
1. Compare the terms in a document and query.
2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the
query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
16

Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
Many variety of words used in texts,
but not all are important.
Among the important words, some
are more contextually relevant.
Some basic processes
involved
•Tokenization
•StopWords Removal
•Stemming
•Phrases
•Inverted File
17

Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report
showed profits rose 10%.”
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
18

Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ….
or
19

Common Issues
1. Capitalized words can have different meaning from lower case words
Bush fires the officer. Query: Bush fire
The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's
degree, england's ten largest cities, shriner's
20

3. Numbers can be important, including decimals
nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the
beat, 288358
4. Periods can occur in numbers, abbreviations, URLs, ends of
sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for
documents
21

Indexing (Stopping)
Top 50 Words from AP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
22

Indexing (Stopping)
Stopword list can be created from high-frequency words or based
on a standard list
Lists are customized for applications, domains, and even parts of
documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions
about which words to use at query time?
23

Indexing (Stemming)
Many morphological variations of words
inflectional (plurals, tenses)
derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words
to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
24

Indexing (Stemming)
Porter Stemmer
Algorithmic stemmer used in
IR experiments since the 70s
Consists of a series of rules
designed to the longest
possible suffix at each step
Produces stems not words
Example Step 1 (right figure)
25

Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:
Identify syntactic phrases using a part-of-speech (POS) tagger
Use word n-grams
Store word positions in indexes and use proximity operators in
queries
26

Indexing (Phrases)
Example Noun Phrases
* Other method like N-Gram
27

Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and
other information.
Each entry is called a posting.
The part of the posting that refers to a specific document or location
is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)
28

Sample collection. 4 sentences fromWikipedia entry for Tropical
Fish
29

Simple inverted index.
30

Inverted index with
counts.
Support better
ranking algorithms.
31

Indexing
(Inverted Index)
Inverted index with
positions.
Support proximity
matching.
32

Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score
computing using the document representation, the query, and a
ranking algorithm
33

Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term
weights.
Collection represented by a matrix of term weights.
34

borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of useful terms
35

borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
Note: Doc Length,Term Location,Term Semantic Meaning
36

Documents ranked by distance between points
representing query and documents
Similarity measure more common than a distance or dissimilarity
measure
e.g. Cosine correlation
37

Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection, D1 = “new straits times”
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 =
0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is
more relevant?
38

Evaluation
A must to evaluate the retrieval function, preprocessing
steps etc.
StandardCollection
Task specific
Human experts are used to judge relevant results.
Performance Metric
Precision
Recall
39

Evaluation (Collection)
Test collections consisting of documents, queries, and relevance
judgments, e.g.,
40

Evaluation (Collection)
Example query and
narrative for golden
standard.
41

Evaluation (Effectiveness Measures)
A is set of relevant documents,
B is set of retrieved documents
42

Evaluation (Ranking Effectiveness)
43

Evaluation (Ranking Effectiveness)
Recall@4 = 3/4
Precision@4 = 3/4
Recall@2 = 2/4
Precision@2 = 2/2 44

Challenges
SocialTexts,
e.g.Tweets,
Posts
Hard question.
Hard Disk ?
Named Entity 
Various levels and
aspects of
annotations
45

Challenges
Small Data
Specific search
Improve semantics extensively
Big Data
Multi modal retrieval
Connecting many medias
46

Case: Adding Semantics Bibliography

Improve Search Results Display
Facet-based
semantic
UsefulTerms
Demo: ir.cs.usm.my

Concepts and Challenges of Text Retrieval for Search Engine

More Related Content

What's hot (20)

Similar to Concepts and Challenges of Text Retrieval for Search Engine (20)

More from Gan Keng Hoon (16)

Recently uploaded (20)

Concepts and Challenges of Text Retrieval for Search Engine