3. Topics
–Algorithmic issues in
classic information
retrieval
(IR), e.g. indexing,
– issues related to the
Web IR
3
(IR), e.g. indexing,
stemming, etc.
4. Information Retrieval
Information retrieval (IR) deals with the
organization, storage, retrieval and
evaluation of information relevant to
user’s query.
4
user’s query.
A user having an information need
formulates a request in the form of query
written in natural language. The retrieval
system responds by retrieving document
that seems relevant to the query
5. Information Retrieval
Traditionally it has been accepted that
information retrieval system does not return
the actual information but the documents
containing that information. As Lancaster
pointed out:
5
pointed out:
‘An information retrieval system does not inform
(i.e. change the knowledge of) the user on the
subject of her inquiry. It merely informs on the
existence (or non-existence) and whereabouts
of documents relating to her request.’
6. A question answering system provides user
with the answers to specific questions.
Data retrieval systems retrieve precise data
6
Data retrieval systems retrieve precise data
An information retrieval system does not
search for specific data as in data retrieval
system, nor search for direct answers to a
question as in question answering system.
8. Classic IR
Input: Document collection
Goal: Retrieve documents or text with
information content that is relevant to
user’s information need
8
user’s information need
Two aspects:
1. Processing the collection
2. Processing queries (searching)
9. Information Retrieval Model
An IR model is a pattern that defines
several aspects of retrieval procedure,
for example,
9
how the documents and user’s queries
are represented
how system retrieves relevant
documents according to users’ queries &
how retrieved documents are ranked.
10. IR Model
An IR model consists of
- a model for documents
- a model for queries and
10
- a model for queries and
- a matching function which compares
queries to documents.
11. Classical IR Model
IR models can be classified as:
Classical models of IR
Non-Classical models of IR
11
Non-Classical models of IR
Alternative models of IR
12. Classical IR Model
based on mathematical knowledge that
was easily recognized and well understood
simple, efficient and easy to implement
The three classical information retrieval
12
The three classical information retrieval
models are:
-Boolean
-Vector and
-Probabilistic models
13. Non-Classical models of IR
Non-classical information retrieval models
are based on principles other than
similarity, probability, Boolean operations
etc. on which classical retrieval models are
13
etc. on which classical retrieval models are
based on.
information logic model, situation theory
model and interaction model.
14. Alternative IR models
Alternative models are enhancements of
classical models making use of specific
techniques from other fields.
14
Example:
Cluster model, fuzzy model and latent
semantic indexing (LSI) models.
15. Information Retrieval Model
The actual text of the document and query is
not used in the retrieval process. Instead,
some representation of it.
Document representation is matched with
15
Document representation is matched with
query representation to perform retrieval
One frequently used method is to represent
document as a set of index terms or keywords
16. Indexing
The process of transforming document
text to some representation of it is known
as indexing.
16
Different index structures might be used.
One commonly used data structure by IR
system is inverted index.
17. inverted index
An inverted index is simply a list of
keywords (tokens/terms), with each
keyword having pointers to the
documents containing that keyword
17
documents containing that keyword
-A document is tokenized. A nonempty
sequence of characters not including
spaces and punctuations can be
regarded as tokens
18. inverted index
-Each distinct token in the collection may
be represented as a distinct integer
-A document is thus transformed into a
18
-A document is thus transformed into a
sequence of integers
19. Example : Inverted index
D1 The weather is cool.
D2 She has a cool mind.
The inverted index can be
represented as a table
tid did pos
1 (The) 1 1
2 (Weather) 1 2
3 (is) 1 3
4 (cool) 1 4
19
represented as a table
called POSTING.
4 (cool) 1 4
5 (She) 2 1
6 (has) 2 2
7 (a) 2 3
4 (cool) 2 4
8 (mind) 2 5
20. The d1
weather d1
Example : Inverted index
The d1/1
weather d1/2
is d1/3
20
weather d1
is d1
cool d1,d2
She d2
has d2
a d2
mind d2
is d1/3
cool d1/4,d2/4
She d2/1
has d2/2
a d2/3
mind d2/5
21. The computational cost involved in
adopting a full text logical view (i.e. using
full set of words to represent a document)
is high.
21
is high.
Hence, some text operations are usually
performed to reduce the set of
representative keywords.
22. Two most commonly used text
operations are:
1. Stop word elimination and
22
1. Stop word elimination and
2. Stemming
Zipf’s law
23. Stop word elimination involves removal
of grammatical or function words while
stemming reduces distinct words to
their common grammatical root
23
their common grammatical root
24. Indexing
Most of the indexing techniques involve
identifying good document descriptors,
such as keywords or terms, to describe
information content of the documents.
24
information content of the documents.
A good descriptor is one that helps in
describing the content of the document
and in discriminating the document from
other documents in the collection.
25. Term can be a single word or it can be multi-
word phrases.
Example:
Design Features of Information
25
Design Features of Information
Retrieval systems
can be represented by the set of terms :
Design, Features, Information,
Retrieval, systems
or by the set of terms:
Design, Features, Information Retrieval,
Information Retrieval systems
26. Luhn’s early Assumption
Luhn assumed that frequency of word
occurrence in an article gives meaningful
identification of their content.
26
discrimination power for index terms is a
function of the rank order of their
frequency of occurrence
27. Stop Word Elimination
Stop words are high frequency words,
which have little semantic weight and are
thus unlikely to help with retrieval.
27
Such words are commonly used in
documents, regardless of topics; and
have no topical specificity.
29. Stop Word Elimination
Advantage
Eliminating these words can result in
considerable reduction in number of index
terms without losing any significant information.
29
terms without losing any significant information.
Disadvantage
It can sometimes result in elimination of terms
useful for searching, for instance the stop word
A in Vitamin A.
Some phrases like “to be or not to be” consist
entirely of stop words.
31. Stemming
Stemming normalizes morphological
variants
It removes suffixes from the words to
31
It removes suffixes from the words to
reduce them to some root form e.g. the
words compute, computing, computes
and computer will all be reduced to same
word stem comput.
32. Stemming
Porter Stemmer(1980).
Example:
The stemmed representation of
32
The stemmed representation of
Design Features of Information
Retrieval systems
will be
{design, featur, inform, retriev,
system}
33. Stemming
stemming throws away useful distinction.
In some cases it may be useful to help
conflate similar terms resulting in
increased recall in others it may be harmful
33
increased recall in others it may be harmful
resulting in reduced precision
34. Zipf’s law
Zipf law
frequency of words multiplied
by their ranks in a large corpus
34
by their ranks in a large corpus
is approximately constant, i.e.
constant
rank
frequency
36. Luhn’s assumptions
Luhn(1958) attempted to quantify the
discriminating power of the terms by associating
their frequency of occurrence (term frequency)
within the document. He postulated that:
36
-The high frequency words are quite common
(function words)
- low frequency words are rare words
- medium frequency words are useful for
indexing
37. Boolean model
the oldest of the three classical models.
is based on Boolean logic and classical
set theory.
37
set theory.
represents documents as a set of
keywords, usually stored in an inverted
file.
38. Boolean model
Users are required to express their
queries as a boolean expression
consisting of keywords connected with
boolean logical operators (AND, OR,
38
boolean logical operators (AND, OR,
NOT).
Retrieval is performed based on whether
or not document contains the query
terms.
39. Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
39
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in
a normal form - representing a query Q as
follows:
}
,
{
θi
),
θ
( i
i
i t
t
Q
40. Boolean model
1. The set Ri of documents are obtained
that contain or not term ti:
Ri = { },
j
j d
i
d
| ,
}
,
{ i
i t
t
i
40
Ri = { },
where
2. Set operations are used to retrieve
documents in response to Q:
j
j d
i
d
| }
,
{ t
t
i
j
i
j
i d
t
d
t
means
i
R
41. Example: Boolean model
Let the set of original documents be
D = {D1, D2, D3}
Where
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation of
41
organization, storage, retrieval and evaluation of
information relevant to user’s query.
D2 = A user having an information need formulates a
request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving document
that seems relevant to the query.
42. Example: Boolean model
Let the set of terms used to represent these
documents be:
T = {information, retrieval, query}
Then the set D of document will be represented
as follows:
42
as follows:
D = { d1, d2, d3}
Where,
d1 = {information, retrieval, query}
d2 = {information, query}
d3 = {retrieval, query}
43. Example (Contd.)
Let the query Q be:
First, the following sets S1 and S2 of documents
are retrieved in response to Q:
retrieval
n
informatio
Q
43
are retrieved in response to Q:
S1 = { } = {d1, d2}
S2 = { } = {d1, d3}
Then, the following documents are retrieved in
response to query Q:
{ } = {d1}
j
j
d d
on
|informati
j
j
d d
|retrieval
2
1
j
j S
S
d
|
d
44. Boolean model
the model is not able to retrieve
documents partly relevant to user query
unable to produce a ranked list of
44
unable to produce a ranked list of
documents.
users seldom comprise their query with
pure Boolean expression that this model
requires.
45. Extended Boolean model
To overcome weaknesses of Boolean
model, numerous extensions have been
suggested that do provide a ranked list
of documents.
45
of documents.
Discussion of these extensions are
beyond the scope of this tutorial.
Ref. P-norm model (Salton et al., 1983)
and Paice model (Paice, 1984).
46. Vector Space Model
It represents documents and queries as
vectors of features representing terms
features are assigned some numerical
value that is usually some function of
46
value that is usually some function of
frequency of terms
Ranking algorithm compute similarity
between document and query vectors to
yield a retrieval score to each document.
47. Vector Space Model
Given a finite set of n documents:
D = {d1, d2, ...,dj,...,dn}
and a finite set of m terms:
47
T = {t1, t2, ...,ti,...,tm}
Each document will be represented by a
column vector of weights as follows:
(w1j, w2j, w3j, . . wij , … wmj)t
wij is weight of term ti in document dj.
48. Vector Space Model
The document collection as a whole will
be represented by an m x n term–
document matrix as:
48
mn
mj
m2
m1
in
ij
i2
i1
2n
2j
22
21
1n
1j
12
11
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w
49. Example: Vector Space Model
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation
of information relevant to user’s query.
D2 = A user having an information need
49
D2 = A user having an information need
formulates a request in the form of query written
in natural language.
D3 = The retrieval system responds by retrieving
document that seems relevant to the query.
50. Example: Vector Space Model
Let the weights be assigned based on
the frequency of the term within the
document.
50
The term – document matrix is:
1
1
1
1
0
2
0
1
2
51. Vector Space Model
Raw term frequency approach gives too
much importance to the absolute values
of various coordinates of each document
51
52. Consider two document vectors
(2, 2, 1)t
(4, 4, 2)t
52
(4, 4, 2)
The documents look similar except the
differences in magnitude of term
weights.
53. Normalizing term weights
To reduce the importance of the length
of document vectors we normalize
document vectors
Normalization changes all the vectors to
53
Normalization changes all the vectors to
a standard length.
We can convert document vectors to unit
length by dividing each dimension by the
overall length of the vector.
54. Normalizing the term-document matrix:
1
1
1
1
0
2
0
1
2
54
We get
Elements of each column are divided by the
length of the column vector ( )
1
1
1
0.71
0.71
0.33
0.71
0
0.67
0
0.71
0.67
i
ij
w
2
55. Term weighting
Luhn’s postulate can be refined by noting that:
1. The more a document contains a given word
the more that document is about a concept
represented by that word.
55
represented by that word.
2. The less a term occurs in particular
document in a collection, the more
discriminating that term is.
56. Term weighting
The first factor simply means that terms
that occur more frequently represent its
meaning more strongly than those
occurring less frequently
56
occurring less frequently
The second factor considers term
distribution across the document
collection.
57. Term weighting
a measure that favors terms appearing in
fewer documents is required
The fraction n/ni, exactly gives this measure
where,
57
where,
n is the total number of the document in the
collection
& ni is the number of the document in which
term i occurs
58. Term weighting
As the number of documents in any collection
is usually large, log of this measure is usually
taken, resulting in the following form of inverse
document frequency (idf) term weight:
58
document frequency (idf) term weight:
59. Tf-idf weighting scheme
tf - document specific statistic
59
tf - document specific statistic
idf - is global statistic and attempts to include
distribution of term across document
collection.
60. Tf-idf weighting scheme
The term frequency (tf) component is
document specific statistic that
measures the importance of term within
the document
60
the document
The inverse document frequency (idf) is
global statistic and attempts to include
distribution of term across document
collection.
62. Normalizing tf and idf factors
by dividing the term frequency by the
frequency of the most frequent term in the
document
idf can be normalized by dividing it by the
62
idf can be normalized by dividing it by the
logarithm of the collection size (n).
63. Term weighting schemes
A third factor that may affect weighting function
is the document length
the weighting schemes can thus be
characterized by the following three factors:
63
characterized by the following three factors:
1. Within-document frequency or term
frequency
2. Collection frequency or inverse document
frequency
3. Document length
64. term weighting scheme can be
represented by a triple ABC
A - tf component
64
A - tf component
B - idf component
& C - length normalization component.
65. Different combinations of options can be
used to represent document and query
vectors.
65
The retrieval model themselves can be
represented by a pair of triples like
nnn.nnn (doc = “nnn”, query = “nnn”)
66. Options for the three weighting
factors
Term frequency within document (A)
n Raw term frequency tf = tfij
b tf = 0 or 1 (binary weight)
tf
66
a Augmented term frequency
l tf = ln(tfij) + 1.0 Logarithmic term frequency
Inverse Document frequency (B)
n wt = tf no conversion
t Multiply tf with idf
j
ij
D
in
max tf
tf
0.5
0.5
tf
67. Options for the three weighting
factors
Document length (C)
n wij = wt (no conversion)
c wij is obtained by dividing each
67
c wij is obtained by dividing each
wt by sqrt(sum of(wts squared))
68. Indexing Algorithm
Step 1. Tokenization: This extracts individual terms
(words) in the document, converts all the words in the
lower case and removes punctuation marks. The output
of the first stage is a representation of the document as a
stream of terms.
68
stream of terms.
Step 2. Stop word elimination: Removes words that
appear more frequently in the document collection.
Step 3. Stemming: reduce remaining terms to their
linguistic root, to get index terms.
Step 4. Term weighting: Assigns weights to term
according to their importance in the document, in the
collection or some combination of both.
70. Similarity Measures
70
inner product
cosine similarity
)
,
(
)
,
(
1
ik
m
i
ij
k
j
k
j w
w
q
d
q
d
sim
)
,
(
)
,
(
1
2
1
2
1
m
i
ij
m
i
ik
ik
m
i
ij
k
j
k
j
k
j
w
w
w
w
q
d
q
d
q
d
sim
71. Evaluation of IR Systems
The evaluation of IR system is the process of assessing
how well a system meets the information needs of its
users (Voorhees, 2001).
Criteria's for evaluation
-Coverage of the collection
71
-Coverage of the collection
-Time lag
- Presentation format
- User effort
- Precision
- Recall
72. Evaluation of IR Systems
The IR evaluation models can be broadly
classified as system driven model and user-
centered model.
System driven model focus on measuring how
72
System driven model focus on measuring how
well the system can rank documents
user–centered evaluation model attempt to
measure the user’s satisfaction with the
system.
73. Why System Evaluation?
There are many retrieval models/
algorithms/ systems, which one is the best?
What is the best component for:
•
73
What is the best component for:
• Ranking function (dot-product, cosine, …)
• Term selection (stop word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
How far down the ranked list will a user
need to look to find some/all relevant
documents?
74. Evaluation of IR Systems
Traditional goal of IR is to retrieve all and
only the relevant documents in response to
a query
All is measured by recall: the proportion of
74
All is measured by recall: the proportion of
relevant documents in the collection which
are retrieved
Only is measured by precision: the
proportion of retrieved documents which are
relevant
75. Precision vs. Recall
Retrieved
All docs
RelRetrieved
75
Relevant
|
Relevant
|
|
|
Recall
ed
RelRetriev
|
Retrieved
|
|
|
Precision
ed
RelRetriev
76. Trade-off between Recall and
Precision
1
Precision
The ideal
Returns relevant documents but
misses many useful ones too
76
1
0
Recall
Precision
Returns most relevant
documents but includes
lots of junk
77. Test collection approach
The total number of relevant documents in a
collection must be known in order for recall to
be calculated.
To provide a framework of evaluation of IR
77
To provide a framework of evaluation of IR
systems, a number of test collections have
been developed (Cranfield, TREC etc.).
These document collections are accompanied
by a set of queries and relevance judgments.
78. IR test collections
Collection Number of documents Number of queries
Cranfield 1400 225
CACM 3204 64
CISI 1460 112
LISA 6004 35
78
TIME 423 83
ADI 82 35
MEDLINE 1033 30
TREC-1 742,611 100
___________________________________________________________
79. Fixed Recall Levels
One way to evaluate is to look at average
precision at fixed recall levels
• Provides the information needed for
79
• Provides the information needed for
precision/recall graphs
80. Document Cutoff Levels
Another way to evaluate:
• Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
80
• top 20
• top 50
• top 100
• Measure precision at each of these levels
• Take (weighted) average over results
focuses on how well the system ranks the first k
documents.
81. Computing Recall/Precision Points
For a given query, produce the ranked list
of retrievals.
Mark each document in the ranked list that
is relevant according to the gold standard.
81
is relevant according to the gold standard.
Compute a recall/precision pair for each
position in the ranked list that contains a
relevant document.
82. Computing Recall/Precision Points:
An Example
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
82
R=3/6=0.5; P=3/4=0.75
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
100% recall
83. Interpolating a Recall/Precision Curve
Interpolate a precision value for each
standard recall level:
• rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0}
83
1.0}
• r0 = 0.0, r1 = 0.1, …, r10=1.0
The interpolated precision at the j-th
standard recall level is the maximum known
precision at any recall level greater than or
equal to j-th level.
86. Average Recall/Precision Curve
Compute average precision at each
standard recall level across all queries.
Plot average precision/recall curves to
86
Plot average precision/recall curves to
evaluate overall system performance on
a document/query corpus.
88. Problems with Precision/Recall
Can’t know true recall value
• except in small collections
Precision/Recall are related
•
88
• A combined measure sometimes more
appropriate
Assumes batch mode
• Interactive IR is important and has different
criteria for successful searches
Assumes a strict rank ordering matters.
89. Other measures: R-Precision
R-Precision is the precision after R
documents have been retrieved, where R
is the number of relevant documents for
89
is the number of relevant documents for
a topic.
• It de-emphasizes exact ranking of the
retrieved relevant documents.
• The average is simply the mean R-Precision
for individual topics in the run.
90. Other measures: F-measure
F-measure takes into account both
precision and recall. It is defined as
harmonic mean of recall and precision.
90
R
P
2PR
F
91. Evaluation Problems
Realistic IR is interactive; traditional IR
methods and measures are based on
non-interactive situations
91
non-interactive situations
Evaluating interactive IR requires human
subjects (no gold standard or
benchmarks)
[Ref.: See Borlund, 2000 & 2003; Borlund & Ingwersen,
1997 for IIR evaluation]
92. Web IR : II
Tutorial on Web IR (Contd.)
92
Tutorial on Web IR (Contd.)
93. Web Page: Basics
are written in a tagged markup language called the Hypertext
Markup Language (HTML)
Contain hyperlinks
- Hyperlink creates a link to another web page using a uniform
resource locator (URL)
- A URL(http://www.icdim.org/program.asp) contains
93
- A URL(http://www.icdim.org/program.asp) contains
a protocol field (http)
a server hostname ( www.icdim.org)
a file path (/, the ‘root of the published file system)
are served through the Internet using Hypertext transfer
protocol (Http) to client computer
- Http is build on top of TCP (Transport control protocol)
can be viewed using browsers
94. IR on the Web
Input: The publicly accessible Web
Goal: Retrieve high quality pages that are relevant to
user’s need
– Static (files: text, audio, … )
– Dynamically generated on request: mostly data base
94
– Dynamically generated on request: mostly data base
access
Two aspects:
1. Processing and representing the collection
• Gathering the static pages
• “Learning” about the dynamic pages
2. Processing queries (searching)
95. How Web IR differs from classic IR?
The Web is:
Huge,
Dynamic
95
Dynamic
Self-organized, and
hyperlinked
96. How Web IR differs from classic IR?
1. Pages:
Bulk …………………… >1B (12/99)
Lack of stability……….. Estimates: 23%/day, 38%/week
Heterogeneity
– Type of documents .. Text, pictures, audio, scripts,…
96
– Type of documents .. Text, pictures, audio, scripts,…
– Quality ………………
– Language ………….. 100+
Duplication
– Syntactic……………. 30% (near) duplicates
– Semantic……………. ??
High linkage……………≥ 8 links/page in the average
97. The big challenge
Meet the user needs given the
heterogeneity of Web pages
97
heterogeneity of Web pages
98. How Web IR differs from classic IR?
2. Users
Make poor queries
– Short (2.35 terms avg)
– Imprecise terms
98
– Sub-optimal syntax (80% queries without operator)
– Low effort
99. How Web IR differs from classic IR?
Wide variance in
– Needs
– Knowledge
99
– Bandwidth
Specific behavior
– 85% look over one result screen only
–78% of queries are not modified
– Follow links
100. The big challenge
Meet the user needs given the
100
Meet the user needs given the
heterogeneity of Web pages and the
poorly made queries.
101. Web IR : The bright side
Many tools available
Personalization
Interactivity (refine the query if needed)
101
Interactivity (refine the query if needed)
102. Web IR tools
General-purpose search engines:
– Direct: AltaVista, Excite, Google, Infoseek,
Lycos, ….
– Indirect (Meta-search): MetaCrawler,
102
– Indirect (Meta-search): MetaCrawler,
AskJeeves, ...
Hierarchical directories:
Yahoo!, The Best of the Web (http://botw.org)
104. Web IR tools
Specialized search engines:
– Home page finder: Ahoy
– Shopping robots: Jango, Junglee,…
– School and University: searchedu.com
104
– School and University: searchedu.com
– Live images from everywhere: EarthCam
(http://www.earthcam.com)
– The Search Engines Directory
http://www.searchengineguide.com/searchengines.html
Search-by-example: Alexa’s “What’s related”,
Excite’s “More like this”
105. Search Engines’ Component
1. Crawler (Spider, robot, or bot)– fetches
web pages
2. Indexer -- process and represents the
105
2. Indexer -- process and represents the
data
3. Search interface -- answers queries
106. Crawler: Basic Principles
A Crawler collects web pages by scanning
collected web pages for hyperlinks to other
pages that have not been collected yet.
1. Start from a given set of URLs
106
1. Start from a given set of URLs
2. fetch and scan them for new URLs (outlinks)
3. fetch these pages in turn
4. Go to step 1 (Repeat steps 1-3 for new pages)
[Ref. : Heydon & Najork, 1999; Brin & Page, 1998]
107. Indexing the web
In traditional IR, documents are static and non-linked.
Web documents are dynamic: documents are added,
modified or deleted.
The mappings from terms to document and positions
107
The mappings from terms to document and positions
are not constructed incrementally (Slide 19).
How to update index?
Web documents are linked: contain hyperlinks
How to rank documents?
108. Web pages are dynamic: Some
figures
40% of all webpages in their dataset
changed within a week, and 23% of the
.com pages changed daily
108
[Junghoo Cho and Hector Garcia-Molina, 2000]
109. Updating index : A Solution
- A static index is made which is the main index used
for answering queries
- A signed (d,t) record as (d,t,s) is maintained for
documents being added or deleted (or modified), where
s is a bit to specify if the document has been deleted or
109
s is a bit to specify if the document has been deleted or
inserted.
- A stop-press index is created using (d,t,s) record.
- Query is sent to both main index and stop-press index.
- main index returns a set of document (D0 )
- stop-press index returns two sets (D+ and D-)
D+ is the set of documents not yet indexed
and D- is set of document matching the query that have been
removed from the collection since D0 is constructed.
110. Updating index
- the retrieved set is constructed as D0 U
D+ D-
- When the stop-press index gets too
large, the signed (d,t,s) records are
110
large, the signed (d,t,s) records are
sorted in (d,t,s) order and merge-purged
into the master (t,d) records.
- Master index is rebuilt
- stop-press index is emptied
111. Index Compression Techniques
A significant portion of index space is
used by document IDs.
Delta encoding is used to save space
- Sort document IDs in increasing order
111
- Sort document IDs in increasing order
- Store first ID in full, and only gaps (i.e.
the difference from previous ID) for
subsequent entries
112. Delta Encoding: Example
Suppose the word harmony appears in
document 50, 75, 110, 120, 125, 170,
200. The record for harmony is the
vector (50,25,35,10,5,45,30)
112
vector (50,25,35,10,5,45,30)
113. Other Issues
Spamming
Adding popular terms to pages unrelated to those terms
Titles, headings, metatags
search engines give additional weights to terms occurring in titles, headings, font modifiers
and metatgs.
Approximate string matching
113
Approximate string matching
-Soundex
-n-gram (Google suggests variant spellings of query terms based on query log using n-
gram)
Metasearch engines
Meta-search engines send the query to several search engines at once and
return the results from all of the search engines in one long unified list.
116. Link Analysis
Link analysis is a technique that exploited the
additional information inherent in the hyperlink
structure of the Web, to improve the quality of search
results.
Link analysis or ranking algorithms underlie
116
Link analysis or ranking algorithms underlie
several of today’s most popular and successful
search engines
All major search engines combine link analysis
scores with more traditional information retrieval
scores to retrieve web pages in response to a query
PageRank and HITS are two algorithms for ranking
web pages based on links.
117. Google’s Search behaviour: Few
points
Google considers over a hundred factors in determining which
documents are most relevant to a query, including the
popularity of the page, the position and size of the search
terms within the page, term order and the proximity of the
search terms to one another on the page
Words in a special typeface, bold, underlined or all capitals,
117
Words in a special typeface, bold, underlined or all capitals,
get extra credit.
Google can also match multi-word phrases and sentences
Google does not disclose everything that goes into its search
index, but the cornerstone algorithm, called PageRank, is well
known.
For more information on Google's technology, visit
www.google.com/technology
118. Google’s Search behaviour
Google returns only pages that match
all your search terms.
A search for [ cricket match ] finds pages
118
A search for [ cricket match ] finds pages
containing the words " cricket " and " match "
119. Google’s Search behaviour
Google returns pages that match your
search terms exactly.
If you search for “tv” Google won't find
119
If you search for “tv” Google won't find
“television”, if you search for “cheap” Google
won't find “inexpensive”.
120. Google’s Search behaviour
Google returns pages that match
variants of your search terms.
The query [ child lock] finds pages that
120
The query [ child lock] finds pages that
contain words that are similar to some or all
of your search terms, e.g., "child," "children”,
or "children's," “locks" or “locking”.
121. Google’s Search behaviour
Google ignores stop words
Google favors results that have your
search terms near each other.
121
search terms near each other.
(Google considers the proximity of your
search terms within a page)
122. Google’s Search behaviour
Google gives higher priority to pages that have
the terms in the same order as in your query.
Google is NOT case sensitive; it assumes all
search terms are lowercase
122
search terms are lowercase
Google ignores some punctuation and special
characters, including , . ; ? [ ] ( ) @ / # .
[ Dr. Tanveer] returns the same results as [ Dr
Tanveer ]
123. Google’s Search behaviour
A term with an apostrophe (single
quotes) doesn't match the term without
an apostrophe.
Google searches for variations on any
123
Google searches for variations on any
hyphenated terms.
[e-mail ] matches "e-mail," "email," and
"e mail"
125. Google’s PageRank
PageRank is a system for ranking web pages,
developed by the Google founders Larry Page
and Sergey Brin at Stanford University
PageRank determines the quality of a Web page
125
PageRank determines the quality of a Web page
by the pages that link to it.
Once a month, Google's spiders crawl the Web
to update and enhance the Google index
PageRank is a numeric value that represents
how important a page is on the web
126. Google’s PageRank
PageRank looks at a Web page and determines how
many other pages link to it (a measure of popularity).
PageRank then analyze the links to those pages.
When one page links to another page, it is
126
When one page links to another page, it is
effectively casting a vote for the other page.
The more votes that are cast for a page, the
more important the page must be.
Google calculates a page's importance from
the votes cast for it.
127. Google’s PageRank
PageRank of Page A, PR(A) is :
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn)
/C(Tn) )
Where,
127
T1...Tn are pages linking to page A
d is a dampening factor, usually set to 0.85
PR(T) is the PageRank of Page T.
C(T) is the number of links going out of page T.
PR(T)/C(T) is the PageRank of Page T divided by the
number of links going out of that page.
128. Google’s PageRank
More simply,
page's PageRank = 0.15 + 0.85 * (a "share" of
the PageRank of every page that links to it)
"share" = the linking page's PageRank divided
128
"share" = the linking page's PageRank divided
by the number of outgoing links on the page
A page "votes" an amount of PageRank onto
each page that it links to
129. Google’s PageRank
The PageRank of a page that links to
yours is important but the number of
links on that page is also important.
129
The more links there are on a page, the
less PageRank value your page will
receive from it.
130. Google’s PageRank
page rank algorithm is applied by firstly
guessing a PageRank for all the pages that
have been indexed and then recursively
iterating until the PageRank converges.
what PageRank in effect says is that pages
130
what PageRank in effect says is that pages
"vote" for other pages on the Internet. So if
Page A links to Page B (i.e. Page A votes to
page B), it is saying that B is an important
page.
- if lots of pages link to a page, then it has
more votes and its worth should be higher
131. Google’s PageRank: Example
Example
Let's consider a web site with 3 page (pages A,
B and C) with no links coming in from the
outside and an initial PageRank of 1
After one iteration:
131
After one iteration:
PR(A) = (1-d) + d(......)
= 0.15
PR(B) = PR(C) = 0.15
After 10 iteration:
PR(A) = PR(B) = PR(C) = 0.15
A B
C
132. Example
Now, we link page A to page B and run the
calculations for each page.
We end up with:-
Page A = 0.15
A B
132
Page A = 0.15
Page B = 1
Page C = 0.15
After second iteration the figures are:-
Page A = 0.15
Page B = 0.2775
Page C = 0.15
C
133. Example
Now, we Link all pages to all pages.
And repeat calculations with initial
pagerank of 1. A B
133
pagerank of 1.
We get,
Page A = 1
Page B = 1
Page C = 1
A
C
B
134. Example
Now remove the links between page B and page C.
after 1 iteration the results are:
PR(A) = 0.15 + 0.85(1+1) =1.85
PR(B) = 0.15 + 0.85(1/2) = 0.575
PR(C) = 0.575
and after second iteration, the results are:
A
C
B
134
and after second iteration, the results are:
PR(A) = 1.1275
PR(B) = 0.93625
PR(C) = 0.93625
after third iteration:
PR(A) = 1.741625
PR(B) = 0.629
PR(C) = 0.629
...
136. Semantic Web?
semantic stands for the meaning of.
The semantic of something is the
meaning of something.
136
Semantic Web is about describing
things in a way that computers
applications can understand.
137. Semantic Web
Semantic Web is not about links
between web pages.
It describes the relationships between
137
It describes the relationships between
things (like A is a part of B and Y is a
member of Z) and the properties of
things (like size, weight, age, etc. )
138. Semantic Web
Semantic Web = a Web with a meaning
"If HTML and the Web made all the online
documents look like one huge book, RDF, schema,
and inference languages will make all the data in
138
and inference languages will make all the data in
the world look like one huge database"
Tim Berners-Lee, Weaving the Web, 1999
The Semantic Web uses RDF to describe web
resources.
139. RDF
The Resource Description Framework
RDF (Resource Description Framework) is a
markup language for describing information
and resources on the web.
139
and resources on the web.
Putting information into RDF files, makes it
possible for computer programs ("web
spiders") to search, discover, pick up, collect,
analyze and process information from the web.
140. References
Soumen Chakrabarti, “Mining the Web: Discovering knowledge from hypertext data”, Elsevier, 2003.
B. J. Jansen, A. Spink and T. Saracevic, ‘Real life, real user and real needs: A study and analysis of
user queries on the Web”. Information Processing & Management, 36(2), 207-227, 2000
G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval”. Communication of the
ACM; 26(11), 1022-026, 1983.
G. Salton, and M. J. McGill, “Introduction to modern information retrieval”. New York: McGraw-Hill,
1983.
C. J. van Rijsbergen “Information Retrieval”. 2nd ed. Butterworths, London.
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
140
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
Baeza-Yates (Eds.). Information Retrieval Data Structures and algorithms. Prentice Hall, pp. 393-
418, 1992.
H. P. Luhn, “The automatic creation of literature abstracts”. IBM journal of Research and
Development, 2(2), 1958.
C. P. Paice. “Soft evaluation of Boolean search queries in information retrieval systems”. Information
technology: Research and development, 3(1):33-42, 1984.
M. F. Porter, “An algorithm for suffix stripping. Program”, 14(3), 130-137, 1980.
A. Heydon and M. Najork: A scalable, extensible web crawler. World wide web Conference, 2(4),
pages 219-229, 1999
S. Brin and L. page. The anatomy f large-scale hypertextual Web search engine. In proceedings of the
7th World Wide Web Conference, 1998. decweb.ethz.ch/WWW7/1921/com1921.htm.
141. References
P. Borlund and P. Ingwersen, “The development of a method for the evaluation of
interactive information retrieval systems”. Journal of Documentation, 53:225–250,
1997.
P. Borlund. “Experimental components for the evaluation of interactive information
retrieval systems”. Journal of Documentation, Vol. 56, No. 1, 2000.
P. Borlund, “The IIR evaluation model: a framework for evaluation of interactive
information retrieval systems”. Information research, 8(3), 2003.
141
Anastasios Tombros, C.J. van Rijsbergen. “Query-sensitive similarity measures for
information retrieval”. Knowledge and Information Systems, 6, 617–642, 2004.
Tanveer J Siddiqui and Uma Shanker Tiwary, “Integrating Relation and keyword matching
in information retrieval”. In Rajiv Kosla, Rober J. Howlett, Lakhmi C. Jain(Eds.).
Proceedings of 9th International Conference on Knowledge-Based Intelligent
Information and Engineering Systems: Data mining and soft computing applications-II,
Lecture Notes in Computer Science, vol. 3684, pages 64-71, Melbourne, Australia,
September, 2005. Springer Verlag.
Siddiqui, Tanveer J. and Tiwary, U. S., “A hybrid model to improve relevance in document
retrieval.” Journal of Digital Information Management, 4(1), 2006, p.73-81.
142. Resources
Text Books: Salton, Rijsberjen
Sander Dominich
Frakes & Baeza-Yates
Suggested Readings:
,
142
Robertson, Sparck-Jones, Voorhees
Myaeng, Liddy, Khoo, I. Ounis, Gelbukh, A F Smeaton, T Strzalkowski
B. J. Jansen, A. Spink
Padmini Srinivasan, M Mitra, A Singhal
T. Saracevic, D. Harman, P. Borlund (evaluation/relevance)
143. Resources
Information Retrieval
Information Processing & Management
JASIS (J. of American Soc. for Info. Science)
TOIS(ACM trans. On Information System)
143
TOIS(ACM trans. On Information System)
Information Research (Online)
Proceedings of SIGIR/TREC conferences
Jou. of Digital Information Management
KAIS (Springer)