ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

Web Information Retrieval
1
Tanveer J Siddiqui
J K Institute of Applied Physics & Technology
University of Allahabad

Topics
–Algorithmic issues in
classic information
retrieval
(IR), e.g. indexing,
– issues related to the
Web IR
3
(IR), e.g. indexing,
stemming, etc.

Information Retrieval
 Information retrieval (IR) deals with the
organization, storage, retrieval and
evaluation of information relevant to
user’s query.
4
user’s query.
 A user having an information need
formulates a request in the form of query
written in natural language. The retrieval
system responds by retrieving document
that seems relevant to the query

 Traditionally it has been accepted that
information retrieval system does not return
the actual information but the documents
containing that information. As Lancaster
pointed out:
5
pointed out:
‘An information retrieval system does not inform
(i.e. change the knowledge of) the user on the
subject of her inquiry. It merely informs on the
existence (or non-existence) and whereabouts
of documents relating to her request.’

 A question answering system provides user
with the answers to specific questions.
 Data retrieval systems retrieve precise data
6
 Data retrieval systems retrieve precise data
 An information retrieval system does not
search for specific data as in data retrieval
system, nor search for direct answers to a
question as in question answering system.

Information Retrieval Process
7

Classic IR
 Input: Document collection
 Goal: Retrieve documents or text with
information content that is relevant to
user’s information need
8
user’s information need
 Two aspects:
1. Processing the collection
2. Processing queries (searching)

Information Retrieval Model
An IR model is a pattern that defines
several aspects of retrieval procedure,
for example,
9
 how the documents and user’s queries
are represented
 how system retrieves relevant
documents according to users’ queries &
 how retrieved documents are ranked.

IR Model
 An IR model consists of
- a model for documents
- a model for queries and
10
- a model for queries and
- a matching function which compares
queries to documents.

Classical IR Model
IR models can be classified as:
 Classical models of IR
 Non-Classical models of IR
11
 Non-Classical models of IR
 Alternative models of IR

Classical IR Model
 based on mathematical knowledge that
was easily recognized and well understood
 simple, efficient and easy to implement
 The three classical information retrieval
12
 The three classical information retrieval
models are:
-Boolean
-Vector and
-Probabilistic models

Non-Classical models of IR
Non-classical information retrieval models
are based on principles other than
similarity, probability, Boolean operations
etc. on which classical retrieval models are
13
etc. on which classical retrieval models are
based on.
information logic model, situation theory
model and interaction model.

Alternative IR models
 Alternative models are enhancements of
classical models making use of specific
techniques from other fields.
14
Example:
Cluster model, fuzzy model and latent
semantic indexing (LSI) models.

Information Retrieval Model
 The actual text of the document and query is
not used in the retrieval process. Instead,
some representation of it.
 Document representation is matched with
15
 Document representation is matched with
query representation to perform retrieval
 One frequently used method is to represent
document as a set of index terms or keywords

Indexing
 The process of transforming document
text to some representation of it is known
as indexing.
16
 Different index structures might be used.
One commonly used data structure by IR
system is inverted index.

inverted index
 An inverted index is simply a list of
keywords (tokens/terms), with each
keyword having pointers to the
documents containing that keyword
17
documents containing that keyword
-A document is tokenized. A nonempty
sequence of characters not including
spaces and punctuations can be
regarded as tokens

inverted index
-Each distinct token in the collection may
be represented as a distinct integer
-A document is thus transformed into a
18
-A document is thus transformed into a
sequence of integers

Example : Inverted index
D1  The weather is cool.
D2  She has a cool mind.
The inverted index can be
represented as a table
tid did pos
1 (The) 1 1
2 (Weather) 1 2
3 (is) 1 3
4 (cool) 1 4
19
represented as a table
called POSTING.
4 (cool) 1 4
5 (She) 2 1
6 (has) 2 2
7 (a) 2 3
4 (cool) 2 4
8 (mind) 2 5

The  d1
weather  d1
Example : Inverted index
The  d1/1
weather  d1/2
is  d1/3
20
weather  d1
is  d1
cool  d1,d2
She  d2
has  d2
a  d2
mind  d2
is  d1/3
cool  d1/4,d2/4
She  d2/1
has  d2/2
a  d2/3
mind  d2/5

 The computational cost involved in
adopting a full text logical view (i.e. using
full set of words to represent a document)
is high.
21
is high.
Hence, some text operations are usually
performed to reduce the set of
representative keywords.

 Two most commonly used text
operations are:
1. Stop word elimination and
22
1. Stop word elimination and
2. Stemming
 Zipf’s law

 Stop word elimination involves removal
of grammatical or function words while
stemming reduces distinct words to
their common grammatical root
23
their common grammatical root

Indexing
 Most of the indexing techniques involve
identifying good document descriptors,
such as keywords or terms, to describe
information content of the documents.
24
information content of the documents.
 A good descriptor is one that helps in
describing the content of the document
and in discriminating the document from
other documents in the collection.

Term can be a single word or it can be multi-
word phrases.
Example:
Design Features of Information
25
Retrieval systems
can be represented by the set of terms :
Design, Features, Information,
Retrieval, systems
or by the set of terms:
Design, Features, Information Retrieval,
Information Retrieval systems

Luhn’s early Assumption
 Luhn assumed that frequency of word
occurrence in an article gives meaningful
identification of their content.
26
 discrimination power for index terms is a
function of the rank order of their
frequency of occurrence

Stop Word Elimination
 Stop words are high frequency words,
which have little semantic weight and are
thus unlikely to help with retrieval.
27
 Such words are commonly used in
documents, regardless of topics; and
have no topical specificity.

Example :
articles (“a”, “an” “the”) and prepositions
(e.g. “in”, “of”, “for”, “at” etc.).
28
(e.g. “in”, “of”, “for”, “at” etc.).

Stop Word Elimination
 Advantage
Eliminating these words can result in
considerable reduction in number of index
terms without losing any significant information.
29
terms without losing any significant information.
 Disadvantage
It can sometimes result in elimination of terms
useful for searching, for instance the stop word
A in Vitamin A.
Some phrases like “to be or not to be” consist
entirely of stop words.

Stop Words
 About, above, accordingly, afterwards,
again, against, alone, along, already,
am, among, amongst, and, another,
any, anyone, anything, anywhere,
30
any, anyone, anything, anywhere,
around, as, aside, awfully, be, because

Stemming
 Stemming normalizes morphological
variants
 It removes suffixes from the words to
31
 It removes suffixes from the words to
reduce them to some root form e.g. the
words compute, computing, computes
and computer will all be reduced to same
word stem comput.

Stemming
 Porter Stemmer(1980).
 Example:
The stemmed representation of
32
The stemmed representation of
Retrieval systems
will be
{design, featur, inform, retriev,
system}

Stemming
stemming throws away useful distinction.
In some cases it may be useful to help
conflate similar terms resulting in
increased recall in others it may be harmful
33
increased recall in others it may be harmful
resulting in reduced precision

Zipf’s law
Zipf law
frequency of words multiplied
by their ranks in a large corpus
34
by their ranks in a large corpus
is approximately constant, i.e.
constant
rank
frequency 


35
Relationship between frequency of words and its rank order

Luhn’s assumptions
 Luhn(1958) attempted to quantify the
discriminating power of the terms by associating
their frequency of occurrence (term frequency)
within the document. He postulated that:
36
-The high frequency words are quite common
(function words)
- low frequency words are rare words
- medium frequency words are useful for
indexing

Boolean model
 the oldest of the three classical models.
 is based on Boolean logic and classical
set theory.
37
set theory.
 represents documents as a set of
keywords, usually stored in an inverted
file.

Boolean model
 Users are required to express their
queries as a boolean expression
consisting of keywords connected with
boolean logical operators (AND, OR,
38
boolean logical operators (AND, OR,
NOT).
 Retrieval is performed based on whether
or not document contains the query
terms.

Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
39
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in
a normal form - representing a query Q as
follows:
}
,
{
θi
),
θ
( i
i
i t
t
Q 





Boolean model
1. The set Ri of documents are obtained
that contain or not term ti:
Ri = { },
j
j d
i
d 

| ,
}
,
{ i
i t
t
i 


40
Ri = { },
where
2. Set operations are used to retrieve
documents in response to Q:
j
j d
i
d 

| }
,
{ t
t
i 


j
i
j
i d
t
d
t 

 means
i
R


Example: Boolean model
Let the set of original documents be
D = {D1, D2, D3}
Where
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation of
41
organization, storage, retrieval and evaluation of
information relevant to user’s query.
D2 = A user having an information need formulates a
request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving document
that seems relevant to the query.

Example: Boolean model
Let the set of terms used to represent these
documents be:
T = {information, retrieval, query}
Then the set D of document will be represented
as follows:
42
as follows:
D = { d1, d2, d3}
Where,
d1 = {information, retrieval, query}
d2 = {information, query}
d3 = {retrieval, query}

Example (Contd.)
Let the query Q be:
First, the following sets S1 and S2 of documents
are retrieved in response to Q:
retrieval
n
informatio 

Q
43
are retrieved in response to Q:
S1 = { } = {d1, d2}
S2 = { } = {d1, d3}
Then, the following documents are retrieved in
response to query Q:
{ } = {d1}
j
j
d d

on
|informati
j
j
d d

|retrieval
2
1
j
j S
S
d
|
d 


Boolean model
 the model is not able to retrieve
documents partly relevant to user query
 unable to produce a ranked list of
44
 unable to produce a ranked list of
documents.
 users seldom comprise their query with
pure Boolean expression that this model
requires.

Extended Boolean model
 To overcome weaknesses of Boolean
model, numerous extensions have been
suggested that do provide a ranked list
of documents.
45
of documents.
 Discussion of these extensions are
beyond the scope of this tutorial.
Ref. P-norm model (Salton et al., 1983)
and Paice model (Paice, 1984).

Vector Space Model
 It represents documents and queries as
vectors of features representing terms
 features are assigned some numerical
value that is usually some function of
46
value that is usually some function of
frequency of terms
 Ranking algorithm compute similarity
between document and query vectors to
yield a retrieval score to each document.

Vector Space Model
 Given a finite set of n documents:
D = {d1, d2, ...,dj,...,dn}
and a finite set of m terms:
47
T = {t1, t2, ...,ti,...,tm}
 Each document will be represented by a
column vector of weights as follows:
(w1j, w2j, w3j, . . wij , … wmj)t
wij is weight of term ti in document dj.

Vector Space Model
The document collection as a whole will
be represented by an m x n term–
document matrix as:
48














mn
mj
m2
m1
in
ij
i2
i1
2n
2j
22
21
1n
1j
12
11
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w

Example: Vector Space Model
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation
of information relevant to user’s query.
D2 = A user having an information need
49
D2 = A user having an information need
formulates a request in the form of query written
in natural language.
D3 = The retrieval system responds by retrieving
document that seems relevant to the query.

Example: Vector Space Model
Let the weights be assigned based on
the frequency of the term within the
document.
50
The term – document matrix is:










1
1
1
1
0
2
0
1
2

Vector Space Model
 Raw term frequency approach gives too
much importance to the absolute values
of various coordinates of each document
51

Consider two document vectors
(2, 2, 1)t
(4, 4, 2)t
52
(4, 4, 2)
The documents look similar except the
differences in magnitude of term
weights.

Normalizing term weights
 To reduce the importance of the length
of document vectors we normalize
document vectors
 Normalization changes all the vectors to
53
 Normalization changes all the vectors to
a standard length.
We can convert document vectors to unit
length by dividing each dimension by the
overall length of the vector.

Normalizing the term-document matrix:










1
1
1
1
0
2
0
1
2
54
We get
Elements of each column are divided by the
length of the column vector ( )

 1
1
1










0.71
0.71
0.33
0.71
0
0.67
0
0.71
0.67

i
ij
w
2

Term weighting
Luhn’s postulate can be refined by noting that:
1. The more a document contains a given word
the more that document is about a concept
represented by that word.
55
represented by that word.
2. The less a term occurs in particular
document in a collection, the more
discriminating that term is.

Term weighting
 The first factor simply means that terms
that occur more frequently represent its
meaning more strongly than those
occurring less frequently
56
occurring less frequently
 The second factor considers term
distribution across the document
collection.

Term weighting
 a measure that favors terms appearing in
fewer documents is required
The fraction n/ni, exactly gives this measure
where,
57
where,
n is the total number of the document in the
collection
& ni is the number of the document in which
term i occurs

Term weighting
 As the number of documents in any collection
is usually large, log of this measure is usually
taken, resulting in the following form of inverse
document frequency (idf) term weight:
58
document frequency (idf) term weight:

Tf-idf weighting scheme
tf - document specific statistic
59
tf - document specific statistic
idf - is global statistic and attempts to include
distribution of term across document
collection.

 The term frequency (tf) component is
document specific statistic that
measures the importance of term within
the document
60
the document
 The inverse document frequency (idf) is
global statistic and attempts to include
distribution of term across document
collection.

Example: Computing tf-idf weight
61

Normalizing tf and idf factors
 by dividing the term frequency by the
frequency of the most frequent term in the
document
 idf can be normalized by dividing it by the
62
 idf can be normalized by dividing it by the
logarithm of the collection size (n).

Term weighting schemes
 A third factor that may affect weighting function
is the document length
 the weighting schemes can thus be
characterized by the following three factors:
63
characterized by the following three factors:
1. Within-document frequency or term
frequency
2. Collection frequency or inverse document
frequency
3. Document length

 term weighting scheme can be
represented by a triple ABC
A - tf component
64
A - tf component
B - idf component
& C - length normalization component.

 Different combinations of options can be
used to represent document and query
vectors.
65
 The retrieval model themselves can be
represented by a pair of triples like
nnn.nnn (doc = “nnn”, query = “nnn”)

Options for the three weighting
factors
 Term frequency within document (A)
n Raw term frequency tf = tfij
b tf = 0 or 1 (binary weight)

 tf
66
a Augmented term frequency
l tf = ln(tfij) + 1.0 Logarithmic term frequency
 Inverse Document frequency (B)
n wt = tf no conversion
t Multiply tf with idf








j
ij
D
in
max tf
tf
0.5
0.5
tf

Options for the three weighting
factors
 Document length (C)
n wij = wt (no conversion)
c wij is obtained by dividing each
67
c wij is obtained by dividing each
wt by sqrt(sum of(wts squared))

Indexing Algorithm
Step 1. Tokenization: This extracts individual terms
(words) in the document, converts all the words in the
lower case and removes punctuation marks. The output
of the first stage is a representation of the document as a
stream of terms.
68
stream of terms.
Step 2. Stop word elimination: Removes words that
appear more frequently in the document collection.
Step 3. Stemming: reduce remaining terms to their
linguistic root, to get index terms.
Step 4. Term weighting: Assigns weights to term
according to their importance in the document, in the
collection or some combination of both.

Example: Document Representation
Stemmed terms Document 1 Document 2 Document 3
69
Stemmed terms Document 1 Document 2 Document 3
inform 0 0 1
intellig 0 0 1
model 1 1 0
probabilist 0 1 0
retriev 0 1 1
space 1 0 0
technique 0 0 1
vector 1 0 0

Similarity Measures
70
 inner product
 cosine similarity
)
,
(
)
,
(
1
ik
m
i
ij
k
j
k
j w
w
q
d
q
d
sim 




)
,
(
)
,
(
1
2
1
2
1










m
i
ij
m
i
ik
ik
m
i
ij
k
j
k
j
k
j
w
w
w
w
q
d
q
d
q
d
sim

Evaluation of IR Systems
 The evaluation of IR system is the process of assessing
how well a system meets the information needs of its
users (Voorhees, 2001).
 Criteria's for evaluation
-Coverage of the collection
71
-Coverage of the collection
-Time lag
- Presentation format
- User effort
- Precision
- Recall

 The IR evaluation models can be broadly
classified as system driven model and user-
centered model.
 System driven model focus on measuring how
72
 System driven model focus on measuring how
well the system can rank documents
 user–centered evaluation model attempt to
measure the user’s satisfaction with the
system.

Why System Evaluation?
 There are many retrieval models/
algorithms/ systems, which one is the best?
 What is the best component for:
•
73
 What is the best component for:
• Ranking function (dot-product, cosine, …)
• Term selection (stop word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
 How far down the ranked list will a user
need to look to find some/all relevant
documents?

 Traditional goal of IR is to retrieve all and
only the relevant documents in response to
a query
All is measured by recall: the proportion of
74
 All is measured by recall: the proportion of
relevant documents in the collection which
are retrieved
 Only is measured by precision: the
proportion of retrieved documents which are
relevant

Trade-off between Recall and
Precision
1
Precision
The ideal
Returns relevant documents but
misses many useful ones too
76
1
0
Recall
Precision
Returns most relevant
documents but includes
lots of junk

Test collection approach
 The total number of relevant documents in a
collection must be known in order for recall to
be calculated.
 To provide a framework of evaluation of IR
77
 To provide a framework of evaluation of IR
systems, a number of test collections have
been developed (Cranfield, TREC etc.).
 These document collections are accompanied
by a set of queries and relevance judgments.

IR test collections
Collection Number of documents Number of queries
Cranfield 1400 225
CACM 3204 64
CISI 1460 112
LISA 6004 35
78
TIME 423 83
ADI 82 35
MEDLINE 1033 30
TREC-1 742,611 100
___________________________________________________________

Fixed Recall Levels
 One way to evaluate is to look at average
precision at fixed recall levels
• Provides the information needed for
79
• Provides the information needed for
precision/recall graphs

Document Cutoff Levels
 Another way to evaluate:
• Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
80
• top 20
• top 50
• top 100
• Measure precision at each of these levels
• Take (weighted) average over results
 focuses on how well the system ranks the first k
documents.

Computing Recall/Precision Points
 For a given query, produce the ranked list
of retrievals.
 Mark each document in the ranked list that
is relevant according to the gold standard.
81
is relevant according to the gold standard.
 Compute a recall/precision pair for each
position in the ranked list that contains a
relevant document.

Computing Recall/Precision Points:
An Example
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
82
R=3/6=0.5; P=3/4=0.75
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
100% recall

Interpolating a Recall/Precision Curve
 Interpolate a precision value for each
standard recall level:
• rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0}
83
1.0}
• r0 = 0.0, r1 = 0.1, …, r10=1.0
 The interpolated precision at the j-th
standard recall level is the maximum known
precision at any recall level greater than or
equal to j-th level.

Example: Interpolated Precision
Precision at observed recall points:
Recall Precision
0.25 1.0
0.4 0.67
0.55 0.8
The interpolated precision :
0.0 1.0
0.1 1.0
0.2 1.0
0.3 0.8
0.4 0.8
84
0.55 0.8
0.8 0.6
1.0 0.5
0.4 0.8
0.5 0.8
0.6 0.6
0.7 0.6
0.8 0.6
0.9 0.5
1.0 0.5
Interpolated average precision =
0.745

Average Recall/Precision Curve
 Compute average precision at each
standard recall level across all queries.
 Plot average precision/recall curves to
86
 Plot average precision/recall curves to
evaluate overall system performance on
a document/query corpus.

Average Recall/Precision Curve
Model
1. doc = “atn”, query = “ntc”
2. doc = “atn”, query = “atc”
3. doc = “atc”, query = “atc”
87
3. doc = “atc”, query = “atc”
4. doc = “atc”, query = “ntc”
5. doc = “ntc”, query = “ntc”
6. doc = “ltc”, query = “ltc”
7. doc = “nnn”, query= “nnn”

Problems with Precision/Recall
 Can’t know true recall value
• except in small collections
 Precision/Recall are related
•
88
• A combined measure sometimes more
appropriate
 Assumes batch mode
• Interactive IR is important and has different
criteria for successful searches
 Assumes a strict rank ordering matters.

Other measures: R-Precision
 R-Precision is the precision after R
documents have been retrieved, where R
is the number of relevant documents for
89
is the number of relevant documents for
a topic.
• It de-emphasizes exact ranking of the
retrieved relevant documents.
• The average is simply the mean R-Precision
for individual topics in the run.

Other measures: F-measure
 F-measure takes into account both
precision and recall. It is defined as
harmonic mean of recall and precision.
90
R
P
2PR
F



Evaluation Problems
 Realistic IR is interactive; traditional IR
methods and measures are based on
non-interactive situations
91
non-interactive situations
 Evaluating interactive IR requires human
subjects (no gold standard or
benchmarks)
[Ref.: See Borlund, 2000 & 2003; Borlund & Ingwersen,
1997 for IIR evaluation]

Web IR : II
Tutorial on Web IR (Contd.)
92
Tutorial on Web IR (Contd.)

Web Page: Basics
 are written in a tagged markup language called the Hypertext
Markup Language (HTML)
 Contain hyperlinks
- Hyperlink creates a link to another web page using a uniform
resource locator (URL)
- A URL(http://www.icdim.org/program.asp) contains
93
- A URL(http://www.icdim.org/program.asp) contains
a protocol field (http)
a server hostname ( www.icdim.org)
a file path (/, the ‘root of the published file system)
 are served through the Internet using Hypertext transfer
protocol (Http) to client computer
- Http is build on top of TCP (Transport control protocol)
 can be viewed using browsers

IR on the Web
 Input: The publicly accessible Web
 Goal: Retrieve high quality pages that are relevant to
user’s need
– Static (files: text, audio, … )
– Dynamically generated on request: mostly data base
94
– Dynamically generated on request: mostly data base
access
Two aspects:
1. Processing and representing the collection
• Gathering the static pages
• “Learning” about the dynamic pages
2. Processing queries (searching)

How Web IR differs from classic IR?
The Web is:
 Huge,
 Dynamic
95
 Dynamic
 Self-organized, and
 hyperlinked

1. Pages:
Bulk …………………… >1B (12/99)
Lack of stability……….. Estimates: 23%/day, 38%/week
 Heterogeneity
– Type of documents .. Text, pictures, audio, scripts,…
96
– Type of documents .. Text, pictures, audio, scripts,…
– Quality ………………
– Language ………….. 100+
Duplication
– Syntactic……………. 30% (near) duplicates
– Semantic……………. ??
High linkage……………≥ 8 links/page in the average

The big challenge
Meet the user needs given the
heterogeneity of Web pages
97
heterogeneity of Web pages

2. Users
 Make poor queries
– Short (2.35 terms avg)
– Imprecise terms
98
– Sub-optimal syntax (80% queries without operator)
– Low effort

 Wide variance in
– Needs
– Knowledge
99
– Bandwidth
 Specific behavior
– 85% look over one result screen only
–78% of queries are not modified
– Follow links

The big challenge
100
heterogeneity of Web pages and the
poorly made queries.

Web IR : The bright side
 Many tools available
 Personalization
 Interactivity (refine the query if needed)
101
Interactivity (refine the query if needed)

Web IR tools
 General-purpose search engines:
– Direct: AltaVista, Excite, Google, Infoseek,
Lycos, ….
– Indirect (Meta-search): MetaCrawler,
102
– Indirect (Meta-search): MetaCrawler,
AskJeeves, ...
 Hierarchical directories:
Yahoo!, The Best of the Web (http://botw.org)

Web IR tools
Specialized search engines:
– Home page finder: Ahoy
– Shopping robots: Jango, Junglee,…
– School and University: searchedu.com
104
– School and University: searchedu.com
– Live images from everywhere: EarthCam
(http://www.earthcam.com)
– The Search Engines Directory
http://www.searchengineguide.com/searchengines.html
Search-by-example: Alexa’s “What’s related”,
Excite’s “More like this”

Search Engines’ Component
1. Crawler (Spider, robot, or bot)– fetches
web pages
2. Indexer -- process and represents the
105
2. Indexer -- process and represents the
data
3. Search interface -- answers queries

Crawler: Basic Principles
A Crawler collects web pages by scanning
collected web pages for hyperlinks to other
pages that have not been collected yet.
1. Start from a given set of URLs
106
1. Start from a given set of URLs
2. fetch and scan them for new URLs (outlinks)
3. fetch these pages in turn
4. Go to step 1 (Repeat steps 1-3 for new pages)
[Ref. : Heydon & Najork, 1999; Brin & Page, 1998]

Indexing the web
In traditional IR, documents are static and non-linked.
Web documents are dynamic: documents are added,
modified or deleted.
The mappings from terms to document and positions
107
The mappings from terms to document and positions
are not constructed incrementally (Slide 19).
How to update index?
Web documents are linked: contain hyperlinks
How to rank documents?

Web pages are dynamic: Some
figures
40% of all webpages in their dataset
changed within a week, and 23% of the
.com pages changed daily
108
[Junghoo Cho and Hector Garcia-Molina, 2000]

Updating index : A Solution
- A static index is made which is the main index used
for answering queries
- A signed (d,t) record as (d,t,s) is maintained for
documents being added or deleted (or modified), where
s is a bit to specify if the document has been deleted or
109
s is a bit to specify if the document has been deleted or
inserted.
- A stop-press index is created using (d,t,s) record.
- Query is sent to both main index and stop-press index.
- main index returns a set of document (D0 )
- stop-press index returns two sets (D+ and D-)
D+ is the set of documents not yet indexed
and D- is set of document matching the query that have been
removed from the collection since D0 is constructed.

Updating index
- the retrieved set is constructed as D0 U
D+ D-
- When the stop-press index gets too
large, the signed (d,t,s) records are
110
large, the signed (d,t,s) records are
sorted in (d,t,s) order and merge-purged
into the master (t,d) records.
- Master index is rebuilt
- stop-press index is emptied

Index Compression Techniques
 A significant portion of index space is
used by document IDs.
 Delta encoding is used to save space
- Sort document IDs in increasing order
111
- Sort document IDs in increasing order
- Store first ID in full, and only gaps (i.e.
the difference from previous ID) for
subsequent entries

Delta Encoding: Example
Suppose the word harmony appears in
document 50, 75, 110, 120, 125, 170,
200. The record for harmony is the
vector (50,25,35,10,5,45,30)
112
vector (50,25,35,10,5,45,30)

Other Issues
 Spamming
Adding popular terms to pages unrelated to those terms
 Titles, headings, metatags
search engines give additional weights to terms occurring in titles, headings, font modifiers
and metatgs.
Approximate string matching
113
 Approximate string matching
-Soundex
-n-gram (Google suggests variant spellings of query terms based on query log using n-
gram)
 Metasearch engines
Meta-search engines send the query to several search engines at once and
return the results from all of the search engines in one long unified list.

Search Engine Watch
http://searchenginewatch.com/showPage.h
tml?page=2156451
114

Shares of Searches: July 2006
115

Link Analysis
 Link analysis is a technique that exploited the
additional information inherent in the hyperlink
structure of the Web, to improve the quality of search
results.
 Link analysis or ranking algorithms underlie
116
 Link analysis or ranking algorithms underlie
several of today’s most popular and successful
search engines
 All major search engines combine link analysis
scores with more traditional information retrieval
scores to retrieve web pages in response to a query
 PageRank and HITS are two algorithms for ranking
web pages based on links.

Google’s Search behaviour: Few
points
 Google considers over a hundred factors in determining which
documents are most relevant to a query, including the
popularity of the page, the position and size of the search
terms within the page, term order and the proximity of the
search terms to one another on the page
 Words in a special typeface, bold, underlined or all capitals,
117
 Words in a special typeface, bold, underlined or all capitals,
get extra credit.
 Google can also match multi-word phrases and sentences
 Google does not disclose everything that goes into its search
index, but the cornerstone algorithm, called PageRank, is well
known.
 For more information on Google's technology, visit
www.google.com/technology

Google’s Search behaviour
 Google returns only pages that match
all your search terms.
A search for [ cricket match ] finds pages
118
A search for [ cricket match ] finds pages
containing the words " cricket " and " match "

 Google returns pages that match your
search terms exactly.
If you search for “tv” Google won't find
119
If you search for “tv” Google won't find
“television”, if you search for “cheap” Google
won't find “inexpensive”.

 Google returns pages that match
variants of your search terms.
The query [ child lock] finds pages that
120
The query [ child lock] finds pages that
contain words that are similar to some or all
of your search terms, e.g., "child," "children”,
or "children's," “locks" or “locking”.

 Google ignores stop words
 Google favors results that have your
search terms near each other.
121
search terms near each other.
(Google considers the proximity of your
search terms within a page)

 Google gives higher priority to pages that have
the terms in the same order as in your query.
 Google is NOT case sensitive; it assumes all
search terms are lowercase
122
search terms are lowercase
 Google ignores some punctuation and special
characters, including , . ; ? [ ] ( ) @ / # .
[ Dr. Tanveer] returns the same results as [ Dr
Tanveer ]

 A term with an apostrophe (single
quotes) doesn't match the term without
an apostrophe.
 Google searches for variations on any
123
 Google searches for variations on any
hyphenated terms.
[e-mail ] matches "e-mail," "email," and
"e mail"

Google’s Search behaviour: Summary
124

Google’s PageRank
 PageRank is a system for ranking web pages,
developed by the Google founders Larry Page
and Sergey Brin at Stanford University
 PageRank determines the quality of a Web page
125
 PageRank determines the quality of a Web page
by the pages that link to it.
 Once a month, Google's spiders crawl the Web
to update and enhance the Google index
 PageRank is a numeric value that represents
how important a page is on the web

Google’s PageRank
 PageRank looks at a Web page and determines how
many other pages link to it (a measure of popularity).
PageRank then analyze the links to those pages.
 When one page links to another page, it is
126
 When one page links to another page, it is
effectively casting a vote for the other page.
 The more votes that are cast for a page, the
more important the page must be.
 Google calculates a page's importance from
the votes cast for it.

Google’s PageRank
PageRank of Page A, PR(A) is :
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn)
/C(Tn) )
Where,
127
T1...Tn are pages linking to page A
d is a dampening factor, usually set to 0.85
PR(T) is the PageRank of Page T.
C(T) is the number of links going out of page T.
PR(T)/C(T) is the PageRank of Page T divided by the
number of links going out of that page.

Google’s PageRank
More simply,
page's PageRank = 0.15 + 0.85 * (a "share" of
the PageRank of every page that links to it)
"share" = the linking page's PageRank divided
128
 "share" = the linking page's PageRank divided
by the number of outgoing links on the page
 A page "votes" an amount of PageRank onto
each page that it links to

Google’s PageRank
The PageRank of a page that links to
yours is important but the number of
links on that page is also important.
129
The more links there are on a page, the
less PageRank value your page will
receive from it.

Google’s PageRank
 page rank algorithm is applied by firstly
guessing a PageRank for all the pages that
have been indexed and then recursively
iterating until the PageRank converges.
what PageRank in effect says is that pages
130
 what PageRank in effect says is that pages
"vote" for other pages on the Internet. So if
Page A links to Page B (i.e. Page A votes to
page B), it is saying that B is an important
page.
- if lots of pages link to a page, then it has
more votes and its worth should be higher

Google’s PageRank: Example
Example
Let's consider a web site with 3 page (pages A,
B and C) with no links coming in from the
outside and an initial PageRank of 1
After one iteration:
131
After one iteration:
PR(A) = (1-d) + d(......)
= 0.15
PR(B) = PR(C) = 0.15
After 10 iteration:
PR(A) = PR(B) = PR(C) = 0.15
A B
C

Example
Now, we link page A to page B and run the
calculations for each page.
We end up with:-
Page A = 0.15
A B
132
Page A = 0.15
Page B = 1
Page C = 0.15
After second iteration the figures are:-
Page A = 0.15
Page B = 0.2775
Page C = 0.15
C

Example
Now, we Link all pages to all pages.
And repeat calculations with initial
pagerank of 1. A B
133
pagerank of 1.
We get,
Page A = 1
Page B = 1
Page C = 1
A
C
B

Example
Now remove the links between page B and page C.
after 1 iteration the results are:
PR(A) = 0.15 + 0.85(1+1) =1.85
PR(B) = 0.15 + 0.85(1/2) = 0.575
PR(C) = 0.575
and after second iteration, the results are:
A
C
B
134
and after second iteration, the results are:
PR(A) = 1.1275
PR(B) = 0.93625
PR(C) = 0.93625
after third iteration:
PR(A) = 1.741625
PR(B) = 0.629
PR(C) = 0.629
...

Calculate pagerank values for the
following:
A B
135
A
C
B

Semantic Web?
 semantic stands for the meaning of.
The semantic of something is the
meaning of something.
136
 Semantic Web is about describing
things in a way that computers
applications can understand.

Semantic Web
 Semantic Web is not about links
between web pages.
 It describes the relationships between
137
 It describes the relationships between
things (like A is a part of B and Y is a
member of Z) and the properties of
things (like size, weight, age, etc. )

Semantic Web
Semantic Web = a Web with a meaning
"If HTML and the Web made all the online
documents look like one huge book, RDF, schema,
and inference languages will make all the data in
138
and inference languages will make all the data in
the world look like one huge database"
Tim Berners-Lee, Weaving the Web, 1999
The Semantic Web uses RDF to describe web
resources.

RDF
The Resource Description Framework
 RDF (Resource Description Framework) is a
markup language for describing information
and resources on the web.
139
and resources on the web.
 Putting information into RDF files, makes it
possible for computer programs ("web
spiders") to search, discover, pick up, collect,
analyze and process information from the web.

References
Soumen Chakrabarti, “Mining the Web: Discovering knowledge from hypertext data”, Elsevier, 2003.
B. J. Jansen, A. Spink and T. Saracevic, ‘Real life, real user and real needs: A study and analysis of
user queries on the Web”. Information Processing & Management, 36(2), 207-227, 2000
G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval”. Communication of the
ACM; 26(11), 1022-026, 1983.
G. Salton, and M. J. McGill, “Introduction to modern information retrieval”. New York: McGraw-Hill,
1983.
C. J. van Rijsbergen “Information Retrieval”. 2nd ed. Butterworths, London.
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
140
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
Baeza-Yates (Eds.). Information Retrieval Data Structures and algorithms. Prentice Hall, pp. 393-
418, 1992.
H. P. Luhn, “The automatic creation of literature abstracts”. IBM journal of Research and
Development, 2(2), 1958.
C. P. Paice. “Soft evaluation of Boolean search queries in information retrieval systems”. Information
technology: Research and development, 3(1):33-42, 1984.
M. F. Porter, “An algorithm for suffix stripping. Program”, 14(3), 130-137, 1980.
A. Heydon and M. Najork: A scalable, extensible web crawler. World wide web Conference, 2(4),
pages 219-229, 1999
S. Brin and L. page. The anatomy f large-scale hypertextual Web search engine. In proceedings of the
7th World Wide Web Conference, 1998. decweb.ethz.ch/WWW7/1921/com1921.htm.

References
P. Borlund and P. Ingwersen, “The development of a method for the evaluation of
interactive information retrieval systems”. Journal of Documentation, 53:225–250,
1997.
P. Borlund. “Experimental components for the evaluation of interactive information
retrieval systems”. Journal of Documentation, Vol. 56, No. 1, 2000.
P. Borlund, “The IIR evaluation model: a framework for evaluation of interactive
information retrieval systems”. Information research, 8(3), 2003.
141
Anastasios Tombros, C.J. van Rijsbergen. “Query-sensitive similarity measures for
information retrieval”. Knowledge and Information Systems, 6, 617–642, 2004.
Tanveer J Siddiqui and Uma Shanker Tiwary, “Integrating Relation and keyword matching
in information retrieval”. In Rajiv Kosla, Rober J. Howlett, Lakhmi C. Jain(Eds.).
Proceedings of 9th International Conference on Knowledge-Based Intelligent
Information and Engineering Systems: Data mining and soft computing applications-II,
Lecture Notes in Computer Science, vol. 3684, pages 64-71, Melbourne, Australia,
September, 2005. Springer Verlag.
Siddiqui, Tanveer J. and Tiwary, U. S., “A hybrid model to improve relevance in document
retrieval.” Journal of Digital Information Management, 4(1), 2006, p.73-81.

Resources
Text Books: Salton, Rijsberjen
Sander Dominich
Frakes & Baeza-Yates
Suggested Readings:
,
142
Robertson, Sparck-Jones, Voorhees
Myaeng, Liddy, Khoo, I. Ounis, Gelbukh, A F Smeaton, T Strzalkowski
B. J. Jansen, A. Spink
Padmini Srinivasan, M Mitra, A Singhal
T. Saracevic, D. Harman, P. Borlund (evaluation/relevance)

Resources
Information Processing & Management
JASIS (J. of American Soc. for Info. Science)
TOIS(ACM trans. On Information System)
143
TOIS(ACM trans. On Information System)
Information Research (Online)
Proceedings of SIGIR/TREC conferences
Jou. of Digital Information Management
KAIS (Springer)

ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

More Related Content

Similar to ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf (20)

Recently uploaded (20)

ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf