SlideShare a Scribd company logo
Web Information Retrieval
1
Tanveer J Siddiqui
J K Institute of Applied Physics & Technology
University of Allahabad
Example of a query
Engine 1
2
Topics
–Algorithmic issues in
classic information
retrieval
(IR), e.g. indexing,
– issues related to the
Web IR
3
(IR), e.g. indexing,
stemming, etc.
Information Retrieval
 Information retrieval (IR) deals with the
organization, storage, retrieval and
evaluation of information relevant to
user’s query.
4
user’s query.
 A user having an information need
formulates a request in the form of query
written in natural language. The retrieval
system responds by retrieving document
that seems relevant to the query
Information Retrieval
 Traditionally it has been accepted that
information retrieval system does not return
the actual information but the documents
containing that information. As Lancaster
pointed out:
5
pointed out:
‘An information retrieval system does not inform
(i.e. change the knowledge of) the user on the
subject of her inquiry. It merely informs on the
existence (or non-existence) and whereabouts
of documents relating to her request.’
 A question answering system provides user
with the answers to specific questions.
 Data retrieval systems retrieve precise data
6
 Data retrieval systems retrieve precise data
 An information retrieval system does not
search for specific data as in data retrieval
system, nor search for direct answers to a
question as in question answering system.
Information Retrieval Process
7
Classic IR
 Input: Document collection
 Goal: Retrieve documents or text with
information content that is relevant to
user’s information need
8
user’s information need
 Two aspects:
1. Processing the collection
2. Processing queries (searching)
Information Retrieval Model
An IR model is a pattern that defines
several aspects of retrieval procedure,
for example,
9
 how the documents and user’s queries
are represented
 how system retrieves relevant
documents according to users’ queries &
 how retrieved documents are ranked.
IR Model
 An IR model consists of
- a model for documents
- a model for queries and
10
- a model for queries and
- a matching function which compares
queries to documents.
Classical IR Model
IR models can be classified as:
 Classical models of IR
 Non-Classical models of IR
11
 Non-Classical models of IR
 Alternative models of IR
Classical IR Model
 based on mathematical knowledge that
was easily recognized and well understood
 simple, efficient and easy to implement
 The three classical information retrieval
12
 The three classical information retrieval
models are:
-Boolean
-Vector and
-Probabilistic models
Non-Classical models of IR
Non-classical information retrieval models
are based on principles other than
similarity, probability, Boolean operations
etc. on which classical retrieval models are
13
etc. on which classical retrieval models are
based on.
information logic model, situation theory
model and interaction model.
Alternative IR models
 Alternative models are enhancements of
classical models making use of specific
techniques from other fields.
14
Example:
Cluster model, fuzzy model and latent
semantic indexing (LSI) models.
Information Retrieval Model
 The actual text of the document and query is
not used in the retrieval process. Instead,
some representation of it.
 Document representation is matched with
15
 Document representation is matched with
query representation to perform retrieval
 One frequently used method is to represent
document as a set of index terms or keywords
Indexing
 The process of transforming document
text to some representation of it is known
as indexing.
16
 Different index structures might be used.
One commonly used data structure by IR
system is inverted index.
inverted index
 An inverted index is simply a list of
keywords (tokens/terms), with each
keyword having pointers to the
documents containing that keyword
17
documents containing that keyword
-A document is tokenized. A nonempty
sequence of characters not including
spaces and punctuations can be
regarded as tokens
inverted index
-Each distinct token in the collection may
be represented as a distinct integer
-A document is thus transformed into a
18
-A document is thus transformed into a
sequence of integers
Example : Inverted index
D1  The weather is cool.
D2  She has a cool mind.
The inverted index can be
represented as a table
tid did pos
1 (The) 1 1
2 (Weather) 1 2
3 (is) 1 3
4 (cool) 1 4
19
represented as a table
called POSTING.
4 (cool) 1 4
5 (She) 2 1
6 (has) 2 2
7 (a) 2 3
4 (cool) 2 4
8 (mind) 2 5
The  d1
weather  d1
Example : Inverted index
The  d1/1
weather  d1/2
is  d1/3
20
weather  d1
is  d1
cool  d1,d2
She  d2
has  d2
a  d2
mind  d2
is  d1/3
cool  d1/4,d2/4
She  d2/1
has  d2/2
a  d2/3
mind  d2/5
 The computational cost involved in
adopting a full text logical view (i.e. using
full set of words to represent a document)
is high.
21
is high.
Hence, some text operations are usually
performed to reduce the set of
representative keywords.
 Two most commonly used text
operations are:
1. Stop word elimination and
22
1. Stop word elimination and
2. Stemming
 Zipf’s law
 Stop word elimination involves removal
of grammatical or function words while
stemming reduces distinct words to
their common grammatical root
23
their common grammatical root
Indexing
 Most of the indexing techniques involve
identifying good document descriptors,
such as keywords or terms, to describe
information content of the documents.
24
information content of the documents.
 A good descriptor is one that helps in
describing the content of the document
and in discriminating the document from
other documents in the collection.
Term can be a single word or it can be multi-
word phrases.
Example:
Design Features of Information
25
Design Features of Information
Retrieval systems
can be represented by the set of terms :
Design, Features, Information,
Retrieval, systems
or by the set of terms:
Design, Features, Information Retrieval,
Information Retrieval systems
Luhn’s early Assumption
 Luhn assumed that frequency of word
occurrence in an article gives meaningful
identification of their content.
26
 discrimination power for index terms is a
function of the rank order of their
frequency of occurrence
Stop Word Elimination
 Stop words are high frequency words,
which have little semantic weight and are
thus unlikely to help with retrieval.
27
 Such words are commonly used in
documents, regardless of topics; and
have no topical specificity.
Example :
articles (“a”, “an” “the”) and prepositions
(e.g. “in”, “of”, “for”, “at” etc.).
28
(e.g. “in”, “of”, “for”, “at” etc.).
Stop Word Elimination
 Advantage
Eliminating these words can result in
considerable reduction in number of index
terms without losing any significant information.
29
terms without losing any significant information.
 Disadvantage
It can sometimes result in elimination of terms
useful for searching, for instance the stop word
A in Vitamin A.
Some phrases like “to be or not to be” consist
entirely of stop words.
Stop Words
 About, above, accordingly, afterwards,
again, against, alone, along, already,
am, among, amongst, and, another,
any, anyone, anything, anywhere,
30
any, anyone, anything, anywhere,
around, as, aside, awfully, be, because
Stemming
 Stemming normalizes morphological
variants
 It removes suffixes from the words to
31
 It removes suffixes from the words to
reduce them to some root form e.g. the
words compute, computing, computes
and computer will all be reduced to same
word stem comput.
Stemming
 Porter Stemmer(1980).
 Example:
The stemmed representation of
32
The stemmed representation of
Design Features of Information
Retrieval systems
will be
{design, featur, inform, retriev,
system}
Stemming
stemming throws away useful distinction.
In some cases it may be useful to help
conflate similar terms resulting in
increased recall in others it may be harmful
33
increased recall in others it may be harmful
resulting in reduced precision
Zipf’s law
Zipf law
frequency of words multiplied
by their ranks in a large corpus
34
by their ranks in a large corpus
is approximately constant, i.e.
constant
rank
frequency 

35
Relationship between frequency of words and its rank order
Luhn’s assumptions
 Luhn(1958) attempted to quantify the
discriminating power of the terms by associating
their frequency of occurrence (term frequency)
within the document. He postulated that:
36
-The high frequency words are quite common
(function words)
- low frequency words are rare words
- medium frequency words are useful for
indexing
Boolean model
 the oldest of the three classical models.
 is based on Boolean logic and classical
set theory.
37
set theory.
 represents documents as a set of
keywords, usually stored in an inverted
file.
Boolean model
 Users are required to express their
queries as a boolean expression
consisting of keywords connected with
boolean logical operators (AND, OR,
38
boolean logical operators (AND, OR,
NOT).
 Retrieval is performed based on whether
or not document contains the query
terms.
Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
39
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in
a normal form - representing a query Q as
follows:
}
,
{
θi
),
θ
( i
i
i t
t
Q 




Boolean model
1. The set Ri of documents are obtained
that contain or not term ti:
Ri = { },
j
j d
i
d 

| ,
}
,
{ i
i t
t
i 


40
Ri = { },
where
2. Set operations are used to retrieve
documents in response to Q:
j
j d
i
d 

| }
,
{ t
t
i 


j
i
j
i d
t
d
t 

 means
i
R

Example: Boolean model
Let the set of original documents be
D = {D1, D2, D3}
Where
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation of
41
organization, storage, retrieval and evaluation of
information relevant to user’s query.
D2 = A user having an information need formulates a
request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving document
that seems relevant to the query.
Example: Boolean model
Let the set of terms used to represent these
documents be:
T = {information, retrieval, query}
Then the set D of document will be represented
as follows:
42
as follows:
D = { d1, d2, d3}
Where,
d1 = {information, retrieval, query}
d2 = {information, query}
d3 = {retrieval, query}
Example (Contd.)
Let the query Q be:
First, the following sets S1 and S2 of documents
are retrieved in response to Q:
retrieval
n
informatio 

Q
43
are retrieved in response to Q:
S1 = { } = {d1, d2}
S2 = { } = {d1, d3}
Then, the following documents are retrieved in
response to query Q:
{ } = {d1}
j
j
d d

on
|informati
j
j
d d

|retrieval
2
1
j
j S
S
d
|
d 

Boolean model
 the model is not able to retrieve
documents partly relevant to user query
 unable to produce a ranked list of
44
 unable to produce a ranked list of
documents.
 users seldom comprise their query with
pure Boolean expression that this model
requires.
Extended Boolean model
 To overcome weaknesses of Boolean
model, numerous extensions have been
suggested that do provide a ranked list
of documents.
45
of documents.
 Discussion of these extensions are
beyond the scope of this tutorial.
Ref. P-norm model (Salton et al., 1983)
and Paice model (Paice, 1984).
Vector Space Model
 It represents documents and queries as
vectors of features representing terms
 features are assigned some numerical
value that is usually some function of
46
value that is usually some function of
frequency of terms
 Ranking algorithm compute similarity
between document and query vectors to
yield a retrieval score to each document.
Vector Space Model
 Given a finite set of n documents:
D = {d1, d2, ...,dj,...,dn}
and a finite set of m terms:
47
T = {t1, t2, ...,ti,...,tm}
 Each document will be represented by a
column vector of weights as follows:
(w1j, w2j, w3j, . . wij , … wmj)t
wij is weight of term ti in document dj.
Vector Space Model
The document collection as a whole will
be represented by an m x n term–
document matrix as:
48














mn
mj
m2
m1
in
ij
i2
i1
2n
2j
22
21
1n
1j
12
11
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w
w
...
w
....
w
w
Example: Vector Space Model
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation
of information relevant to user’s query.
D2 = A user having an information need
49
D2 = A user having an information need
formulates a request in the form of query written
in natural language.
D3 = The retrieval system responds by retrieving
document that seems relevant to the query.
Example: Vector Space Model
Let the weights be assigned based on
the frequency of the term within the
document.
50
The term – document matrix is:










1
1
1
1
0
2
0
1
2
Vector Space Model
 Raw term frequency approach gives too
much importance to the absolute values
of various coordinates of each document
51
Consider two document vectors
(2, 2, 1)t
(4, 4, 2)t
52
(4, 4, 2)
The documents look similar except the
differences in magnitude of term
weights.
Normalizing term weights
 To reduce the importance of the length
of document vectors we normalize
document vectors
 Normalization changes all the vectors to
53
 Normalization changes all the vectors to
a standard length.
We can convert document vectors to unit
length by dividing each dimension by the
overall length of the vector.
Normalizing the term-document matrix:










1
1
1
1
0
2
0
1
2
54
We get
Elements of each column are divided by the
length of the column vector ( )

 1
1
1










0.71
0.71
0.33
0.71
0
0.67
0
0.71
0.67

i
ij
w
2
Term weighting
Luhn’s postulate can be refined by noting that:
1. The more a document contains a given word
the more that document is about a concept
represented by that word.
55
represented by that word.
2. The less a term occurs in particular
document in a collection, the more
discriminating that term is.
Term weighting
 The first factor simply means that terms
that occur more frequently represent its
meaning more strongly than those
occurring less frequently
56
occurring less frequently
 The second factor considers term
distribution across the document
collection.
Term weighting
 a measure that favors terms appearing in
fewer documents is required
The fraction n/ni, exactly gives this measure
where,
57
where,
n is the total number of the document in the
collection
& ni is the number of the document in which
term i occurs
Term weighting
 As the number of documents in any collection
is usually large, log of this measure is usually
taken, resulting in the following form of inverse
document frequency (idf) term weight:
58
document frequency (idf) term weight:
Tf-idf weighting scheme
tf - document specific statistic
59
tf - document specific statistic
idf - is global statistic and attempts to include
distribution of term across document
collection.
Tf-idf weighting scheme
 The term frequency (tf) component is
document specific statistic that
measures the importance of term within
the document
60
the document
 The inverse document frequency (idf) is
global statistic and attempts to include
distribution of term across document
collection.
Tf-idf weighting scheme
Example: Computing tf-idf weight
61
Normalizing tf and idf factors
 by dividing the term frequency by the
frequency of the most frequent term in the
document
 idf can be normalized by dividing it by the
62
 idf can be normalized by dividing it by the
logarithm of the collection size (n).
Term weighting schemes
 A third factor that may affect weighting function
is the document length
 the weighting schemes can thus be
characterized by the following three factors:
63
characterized by the following three factors:
1. Within-document frequency or term
frequency
2. Collection frequency or inverse document
frequency
3. Document length
 term weighting scheme can be
represented by a triple ABC
A - tf component
64
A - tf component
B - idf component
& C - length normalization component.
 Different combinations of options can be
used to represent document and query
vectors.
65
 The retrieval model themselves can be
represented by a pair of triples like
nnn.nnn (doc = “nnn”, query = “nnn”)
Options for the three weighting
factors
 Term frequency within document (A)
n Raw term frequency tf = tfij
b tf = 0 or 1 (binary weight)

 tf
66
a Augmented term frequency
l tf = ln(tfij) + 1.0 Logarithmic term frequency
 Inverse Document frequency (B)
n wt = tf no conversion
t Multiply tf with idf








j
ij
D
in
max tf
tf
0.5
0.5
tf
Options for the three weighting
factors
 Document length (C)
n wij = wt (no conversion)
c wij is obtained by dividing each
67
c wij is obtained by dividing each
wt by sqrt(sum of(wts squared))
Indexing Algorithm
Step 1. Tokenization: This extracts individual terms
(words) in the document, converts all the words in the
lower case and removes punctuation marks. The output
of the first stage is a representation of the document as a
stream of terms.
68
stream of terms.
Step 2. Stop word elimination: Removes words that
appear more frequently in the document collection.
Step 3. Stemming: reduce remaining terms to their
linguistic root, to get index terms.
Step 4. Term weighting: Assigns weights to term
according to their importance in the document, in the
collection or some combination of both.
Example: Document Representation
Stemmed terms Document 1 Document 2 Document 3
69
Stemmed terms Document 1 Document 2 Document 3
inform 0 0 1
intellig 0 0 1
model 1 1 0
probabilist 0 1 0
retriev 0 1 1
space 1 0 0
technique 0 0 1
vector 1 0 0
Similarity Measures
70
 inner product
 cosine similarity
)
,
(
)
,
(
1
ik
m
i
ij
k
j
k
j w
w
q
d
q
d
sim 




)
,
(
)
,
(
1
2
1
2
1










m
i
ij
m
i
ik
ik
m
i
ij
k
j
k
j
k
j
w
w
w
w
q
d
q
d
q
d
sim
Evaluation of IR Systems
 The evaluation of IR system is the process of assessing
how well a system meets the information needs of its
users (Voorhees, 2001).
 Criteria's for evaluation
-Coverage of the collection
71
-Coverage of the collection
-Time lag
- Presentation format
- User effort
- Precision
- Recall
Evaluation of IR Systems
 The IR evaluation models can be broadly
classified as system driven model and user-
centered model.
 System driven model focus on measuring how
72
 System driven model focus on measuring how
well the system can rank documents
 user–centered evaluation model attempt to
measure the user’s satisfaction with the
system.
Why System Evaluation?
 There are many retrieval models/
algorithms/ systems, which one is the best?
 What is the best component for:
•
73
 What is the best component for:
• Ranking function (dot-product, cosine, …)
• Term selection (stop word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
 How far down the ranked list will a user
need to look to find some/all relevant
documents?
Evaluation of IR Systems
 Traditional goal of IR is to retrieve all and
only the relevant documents in response to
a query
All is measured by recall: the proportion of
74
 All is measured by recall: the proportion of
relevant documents in the collection which
are retrieved
 Only is measured by precision: the
proportion of retrieved documents which are
relevant
Precision vs. Recall
Retrieved
All docs
RelRetrieved
75
Relevant
|
Relevant
|
|
|
Recall
ed
RelRetriev

|
Retrieved
|
|
|
Precision
ed
RelRetriev

Trade-off between Recall and
Precision
1
Precision
The ideal
Returns relevant documents but
misses many useful ones too
76
1
0
Recall
Precision
Returns most relevant
documents but includes
lots of junk
Test collection approach
 The total number of relevant documents in a
collection must be known in order for recall to
be calculated.
 To provide a framework of evaluation of IR
77
 To provide a framework of evaluation of IR
systems, a number of test collections have
been developed (Cranfield, TREC etc.).
 These document collections are accompanied
by a set of queries and relevance judgments.
IR test collections
Collection Number of documents Number of queries
Cranfield 1400 225
CACM 3204 64
CISI 1460 112
LISA 6004 35
78
TIME 423 83
ADI 82 35
MEDLINE 1033 30
TREC-1 742,611 100
___________________________________________________________
Fixed Recall Levels
 One way to evaluate is to look at average
precision at fixed recall levels
• Provides the information needed for
79
• Provides the information needed for
precision/recall graphs
Document Cutoff Levels
 Another way to evaluate:
• Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
80
• top 20
• top 50
• top 100
• Measure precision at each of these levels
• Take (weighted) average over results
 focuses on how well the system ranks the first k
documents.
Computing Recall/Precision Points
 For a given query, produce the ranked list
of retrievals.
 Mark each document in the ranked list that
is relevant according to the gold standard.
81
is relevant according to the gold standard.
 Compute a recall/precision pair for each
position in the ranked list that contains a
relevant document.
Computing Recall/Precision Points:
An Example
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
82
R=3/6=0.5; P=3/4=0.75
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
100% recall
Interpolating a Recall/Precision Curve
 Interpolate a precision value for each
standard recall level:
• rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0}
83
1.0}
• r0 = 0.0, r1 = 0.1, …, r10=1.0
 The interpolated precision at the j-th
standard recall level is the maximum known
precision at any recall level greater than or
equal to j-th level.
Example: Interpolated Precision
Precision at observed recall points:
Recall Precision
0.25 1.0
0.4 0.67
0.55 0.8
The interpolated precision :
0.0 1.0
0.1 1.0
0.2 1.0
0.3 0.8
0.4 0.8
84
0.55 0.8
0.8 0.6
1.0 0.5
0.4 0.8
0.5 0.8
0.6 0.6
0.7 0.6
0.8 0.6
0.9 0.5
1.0 0.5
Interpolated average precision =
0.745
Recall-Precision graph
85
Average Recall/Precision Curve
 Compute average precision at each
standard recall level across all queries.
 Plot average precision/recall curves to
86
 Plot average precision/recall curves to
evaluate overall system performance on
a document/query corpus.
Average Recall/Precision Curve
Model
1. doc = “atn”, query = “ntc”
2. doc = “atn”, query = “atc”
3. doc = “atc”, query = “atc”
87
3. doc = “atc”, query = “atc”
4. doc = “atc”, query = “ntc”
5. doc = “ntc”, query = “ntc”
6. doc = “ltc”, query = “ltc”
7. doc = “nnn”, query= “nnn”
Problems with Precision/Recall
 Can’t know true recall value
• except in small collections
 Precision/Recall are related
•
88
• A combined measure sometimes more
appropriate
 Assumes batch mode
• Interactive IR is important and has different
criteria for successful searches
 Assumes a strict rank ordering matters.
Other measures: R-Precision
 R-Precision is the precision after R
documents have been retrieved, where R
is the number of relevant documents for
89
is the number of relevant documents for
a topic.
• It de-emphasizes exact ranking of the
retrieved relevant documents.
• The average is simply the mean R-Precision
for individual topics in the run.
Other measures: F-measure
 F-measure takes into account both
precision and recall. It is defined as
harmonic mean of recall and precision.
90
R
P
2PR
F


Evaluation Problems
 Realistic IR is interactive; traditional IR
methods and measures are based on
non-interactive situations
91
non-interactive situations
 Evaluating interactive IR requires human
subjects (no gold standard or
benchmarks)
[Ref.: See Borlund, 2000 & 2003; Borlund & Ingwersen,
1997 for IIR evaluation]
Web IR : II
Tutorial on Web IR (Contd.)
92
Tutorial on Web IR (Contd.)
Web Page: Basics
 are written in a tagged markup language called the Hypertext
Markup Language (HTML)
 Contain hyperlinks
- Hyperlink creates a link to another web page using a uniform
resource locator (URL)
- A URL(http://www.icdim.org/program.asp) contains
93
- A URL(http://www.icdim.org/program.asp) contains
a protocol field (http)
a server hostname ( www.icdim.org)
a file path (/, the ‘root of the published file system)
 are served through the Internet using Hypertext transfer
protocol (Http) to client computer
- Http is build on top of TCP (Transport control protocol)
 can be viewed using browsers
IR on the Web
 Input: The publicly accessible Web
 Goal: Retrieve high quality pages that are relevant to
user’s need
– Static (files: text, audio, … )
– Dynamically generated on request: mostly data base
94
– Dynamically generated on request: mostly data base
access
Two aspects:
1. Processing and representing the collection
• Gathering the static pages
• “Learning” about the dynamic pages
2. Processing queries (searching)
How Web IR differs from classic IR?
The Web is:
 Huge,
 Dynamic
95
 Dynamic
 Self-organized, and
 hyperlinked
How Web IR differs from classic IR?
1. Pages:
Bulk …………………… >1B (12/99)
Lack of stability……….. Estimates: 23%/day, 38%/week
 Heterogeneity
– Type of documents .. Text, pictures, audio, scripts,…
96
– Type of documents .. Text, pictures, audio, scripts,…
– Quality ………………
– Language ………….. 100+
Duplication
– Syntactic……………. 30% (near) duplicates
– Semantic……………. ??
High linkage……………≥ 8 links/page in the average
The big challenge
Meet the user needs given the
heterogeneity of Web pages
97
heterogeneity of Web pages
How Web IR differs from classic IR?
2. Users
 Make poor queries
– Short (2.35 terms avg)
– Imprecise terms
98
– Sub-optimal syntax (80% queries without operator)
– Low effort
How Web IR differs from classic IR?
 Wide variance in
– Needs
– Knowledge
99
– Bandwidth
 Specific behavior
– 85% look over one result screen only
–78% of queries are not modified
– Follow links
The big challenge
Meet the user needs given the
100
Meet the user needs given the
heterogeneity of Web pages and the
poorly made queries.
Web IR : The bright side
 Many tools available
 Personalization
 Interactivity (refine the query if needed)
101
Interactivity (refine the query if needed)
Web IR tools
 General-purpose search engines:
– Direct: AltaVista, Excite, Google, Infoseek,
Lycos, ….
– Indirect (Meta-search): MetaCrawler,
102
– Indirect (Meta-search): MetaCrawler,
AskJeeves, ...
 Hierarchical directories:
Yahoo!, The Best of the Web (http://botw.org)
Best of the Web
103
Web IR tools
Specialized search engines:
– Home page finder: Ahoy
– Shopping robots: Jango, Junglee,…
– School and University: searchedu.com
104
– School and University: searchedu.com
– Live images from everywhere: EarthCam
(http://www.earthcam.com)
– The Search Engines Directory
http://www.searchengineguide.com/searchengines.html
Search-by-example: Alexa’s “What’s related”,
Excite’s “More like this”
Search Engines’ Component
1. Crawler (Spider, robot, or bot)– fetches
web pages
2. Indexer -- process and represents the
105
2. Indexer -- process and represents the
data
3. Search interface -- answers queries
Crawler: Basic Principles
A Crawler collects web pages by scanning
collected web pages for hyperlinks to other
pages that have not been collected yet.
1. Start from a given set of URLs
106
1. Start from a given set of URLs
2. fetch and scan them for new URLs (outlinks)
3. fetch these pages in turn
4. Go to step 1 (Repeat steps 1-3 for new pages)
[Ref. : Heydon & Najork, 1999; Brin & Page, 1998]
Indexing the web
In traditional IR, documents are static and non-linked.
Web documents are dynamic: documents are added,
modified or deleted.
The mappings from terms to document and positions
107
The mappings from terms to document and positions
are not constructed incrementally (Slide 19).
How to update index?
Web documents are linked: contain hyperlinks
How to rank documents?
Web pages are dynamic: Some
figures
40% of all webpages in their dataset
changed within a week, and 23% of the
.com pages changed daily
108
[Junghoo Cho and Hector Garcia-Molina, 2000]
Updating index : A Solution
- A static index is made which is the main index used
for answering queries
- A signed (d,t) record as (d,t,s) is maintained for
documents being added or deleted (or modified), where
s is a bit to specify if the document has been deleted or
109
s is a bit to specify if the document has been deleted or
inserted.
- A stop-press index is created using (d,t,s) record.
- Query is sent to both main index and stop-press index.
- main index returns a set of document (D0 )
- stop-press index returns two sets (D+ and D-)
D+ is the set of documents not yet indexed
and D- is set of document matching the query that have been
removed from the collection since D0 is constructed.
Updating index
- the retrieved set is constructed as D0 U
D+  D-
- When the stop-press index gets too
large, the signed (d,t,s) records are
110
large, the signed (d,t,s) records are
sorted in (d,t,s) order and merge-purged
into the master (t,d) records.
- Master index is rebuilt
- stop-press index is emptied
Index Compression Techniques
 A significant portion of index space is
used by document IDs.
 Delta encoding is used to save space
- Sort document IDs in increasing order
111
- Sort document IDs in increasing order
- Store first ID in full, and only gaps (i.e.
the difference from previous ID) for
subsequent entries
Delta Encoding: Example
Suppose the word harmony appears in
document 50, 75, 110, 120, 125, 170,
200. The record for harmony is the
vector (50,25,35,10,5,45,30)
112
vector (50,25,35,10,5,45,30)
Other Issues
 Spamming
Adding popular terms to pages unrelated to those terms
 Titles, headings, metatags
search engines give additional weights to terms occurring in titles, headings, font modifiers
and metatgs.
Approximate string matching
113
 Approximate string matching
-Soundex
-n-gram (Google suggests variant spellings of query terms based on query log using n-
gram)
 Metasearch engines
Meta-search engines send the query to several search engines at once and
return the results from all of the search engines in one long unified list.
Search Engine Watch
http://searchenginewatch.com/showPage.h
tml?page=2156451
114
Shares of Searches: July 2006
115
Link Analysis
 Link analysis is a technique that exploited the
additional information inherent in the hyperlink
structure of the Web, to improve the quality of search
results.
 Link analysis or ranking algorithms underlie
116
 Link analysis or ranking algorithms underlie
several of today’s most popular and successful
search engines
 All major search engines combine link analysis
scores with more traditional information retrieval
scores to retrieve web pages in response to a query
 PageRank and HITS are two algorithms for ranking
web pages based on links.
Google’s Search behaviour: Few
points
 Google considers over a hundred factors in determining which
documents are most relevant to a query, including the
popularity of the page, the position and size of the search
terms within the page, term order and the proximity of the
search terms to one another on the page
 Words in a special typeface, bold, underlined or all capitals,
117
 Words in a special typeface, bold, underlined or all capitals,
get extra credit.
 Google can also match multi-word phrases and sentences
 Google does not disclose everything that goes into its search
index, but the cornerstone algorithm, called PageRank, is well
known.
 For more information on Google's technology, visit
www.google.com/technology
Google’s Search behaviour
 Google returns only pages that match
all your search terms.
A search for [ cricket match ] finds pages
118
A search for [ cricket match ] finds pages
containing the words " cricket " and " match "
Google’s Search behaviour
 Google returns pages that match your
search terms exactly.
If you search for “tv” Google won't find
119
If you search for “tv” Google won't find
“television”, if you search for “cheap” Google
won't find “inexpensive”.
Google’s Search behaviour
 Google returns pages that match
variants of your search terms.
The query [ child lock] finds pages that
120
The query [ child lock] finds pages that
contain words that are similar to some or all
of your search terms, e.g., "child," "children”,
or "children's," “locks" or “locking”.
Google’s Search behaviour
 Google ignores stop words
 Google favors results that have your
search terms near each other.
121
search terms near each other.
(Google considers the proximity of your
search terms within a page)
Google’s Search behaviour
 Google gives higher priority to pages that have
the terms in the same order as in your query.
 Google is NOT case sensitive; it assumes all
search terms are lowercase
122
search terms are lowercase
 Google ignores some punctuation and special
characters, including , . ; ? [ ] ( ) @ / # .
[ Dr. Tanveer] returns the same results as [ Dr
Tanveer ]
Google’s Search behaviour
 A term with an apostrophe (single
quotes) doesn't match the term without
an apostrophe.
 Google searches for variations on any
123
 Google searches for variations on any
hyphenated terms.
[e-mail ] matches "e-mail," "email," and
"e mail"
Google’s Search behaviour: Summary
124
Google’s PageRank
 PageRank is a system for ranking web pages,
developed by the Google founders Larry Page
and Sergey Brin at Stanford University
 PageRank determines the quality of a Web page
125
 PageRank determines the quality of a Web page
by the pages that link to it.
 Once a month, Google's spiders crawl the Web
to update and enhance the Google index
 PageRank is a numeric value that represents
how important a page is on the web
Google’s PageRank
 PageRank looks at a Web page and determines how
many other pages link to it (a measure of popularity).
PageRank then analyze the links to those pages.
 When one page links to another page, it is
126
 When one page links to another page, it is
effectively casting a vote for the other page.
 The more votes that are cast for a page, the
more important the page must be.
 Google calculates a page's importance from
the votes cast for it.
Google’s PageRank
PageRank of Page A, PR(A) is :
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn)
/C(Tn) )
Where,
127
T1...Tn are pages linking to page A
d is a dampening factor, usually set to 0.85
PR(T) is the PageRank of Page T.
C(T) is the number of links going out of page T.
PR(T)/C(T) is the PageRank of Page T divided by the
number of links going out of that page.
Google’s PageRank
More simply,
page's PageRank = 0.15 + 0.85 * (a "share" of
the PageRank of every page that links to it)
"share" = the linking page's PageRank divided
128
 "share" = the linking page's PageRank divided
by the number of outgoing links on the page
 A page "votes" an amount of PageRank onto
each page that it links to
Google’s PageRank
The PageRank of a page that links to
yours is important but the number of
links on that page is also important.
129
The more links there are on a page, the
less PageRank value your page will
receive from it.
Google’s PageRank
 page rank algorithm is applied by firstly
guessing a PageRank for all the pages that
have been indexed and then recursively
iterating until the PageRank converges.
what PageRank in effect says is that pages
130
 what PageRank in effect says is that pages
"vote" for other pages on the Internet. So if
Page A links to Page B (i.e. Page A votes to
page B), it is saying that B is an important
page.
- if lots of pages link to a page, then it has
more votes and its worth should be higher
Google’s PageRank: Example
Example
Let's consider a web site with 3 page (pages A,
B and C) with no links coming in from the
outside and an initial PageRank of 1
After one iteration:
131
After one iteration:
PR(A) = (1-d) + d(......)
= 0.15
PR(B) = PR(C) = 0.15
After 10 iteration:
PR(A) = PR(B) = PR(C) = 0.15
A B
C
Example
Now, we link page A to page B and run the
calculations for each page.
We end up with:-
Page A = 0.15
A B
132
Page A = 0.15
Page B = 1
Page C = 0.15
After second iteration the figures are:-
Page A = 0.15
Page B = 0.2775
Page C = 0.15
C
Example
Now, we Link all pages to all pages.
And repeat calculations with initial
pagerank of 1. A B
133
pagerank of 1.
We get,
Page A = 1
Page B = 1
Page C = 1
A
C
B
Example
Now remove the links between page B and page C.
after 1 iteration the results are:
PR(A) = 0.15 + 0.85(1+1) =1.85
PR(B) = 0.15 + 0.85(1/2) = 0.575
PR(C) = 0.575
and after second iteration, the results are:
A
C
B
134
and after second iteration, the results are:
PR(A) = 1.1275
PR(B) = 0.93625
PR(C) = 0.93625
after third iteration:
PR(A) = 1.741625
PR(B) = 0.629
PR(C) = 0.629
...
Calculate pagerank values for the
following:
A B
135
A
C
B
Semantic Web?
 semantic stands for the meaning of.
The semantic of something is the
meaning of something.
136
 Semantic Web is about describing
things in a way that computers
applications can understand.
Semantic Web
 Semantic Web is not about links
between web pages.
 It describes the relationships between
137
 It describes the relationships between
things (like A is a part of B and Y is a
member of Z) and the properties of
things (like size, weight, age, etc. )
Semantic Web
Semantic Web = a Web with a meaning
"If HTML and the Web made all the online
documents look like one huge book, RDF, schema,
and inference languages will make all the data in
138
and inference languages will make all the data in
the world look like one huge database"
Tim Berners-Lee, Weaving the Web, 1999
The Semantic Web uses RDF to describe web
resources.
RDF
The Resource Description Framework
 RDF (Resource Description Framework) is a
markup language for describing information
and resources on the web.
139
and resources on the web.
 Putting information into RDF files, makes it
possible for computer programs ("web
spiders") to search, discover, pick up, collect,
analyze and process information from the web.
References
Soumen Chakrabarti, “Mining the Web: Discovering knowledge from hypertext data”, Elsevier, 2003.
B. J. Jansen, A. Spink and T. Saracevic, ‘Real life, real user and real needs: A study and analysis of
user queries on the Web”. Information Processing & Management, 36(2), 207-227, 2000
G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval”. Communication of the
ACM; 26(11), 1022-026, 1983.
G. Salton, and M. J. McGill, “Introduction to modern information retrieval”. New York: McGraw-Hill,
1983.
C. J. van Rijsbergen “Information Retrieval”. 2nd ed. Butterworths, London.
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
140
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
Baeza-Yates (Eds.). Information Retrieval Data Structures and algorithms. Prentice Hall, pp. 393-
418, 1992.
H. P. Luhn, “The automatic creation of literature abstracts”. IBM journal of Research and
Development, 2(2), 1958.
C. P. Paice. “Soft evaluation of Boolean search queries in information retrieval systems”. Information
technology: Research and development, 3(1):33-42, 1984.
M. F. Porter, “An algorithm for suffix stripping. Program”, 14(3), 130-137, 1980.
A. Heydon and M. Najork: A scalable, extensible web crawler. World wide web Conference, 2(4),
pages 219-229, 1999
S. Brin and L. page. The anatomy f large-scale hypertextual Web search engine. In proceedings of the
7th World Wide Web Conference, 1998. decweb.ethz.ch/WWW7/1921/com1921.htm.
References
P. Borlund and P. Ingwersen, “The development of a method for the evaluation of
interactive information retrieval systems”. Journal of Documentation, 53:225–250,
1997.
P. Borlund. “Experimental components for the evaluation of interactive information
retrieval systems”. Journal of Documentation, Vol. 56, No. 1, 2000.
P. Borlund, “The IIR evaluation model: a framework for evaluation of interactive
information retrieval systems”. Information research, 8(3), 2003.
141
Anastasios Tombros, C.J. van Rijsbergen. “Query-sensitive similarity measures for
information retrieval”. Knowledge and Information Systems, 6, 617–642, 2004.
Tanveer J Siddiqui and Uma Shanker Tiwary, “Integrating Relation and keyword matching
in information retrieval”. In Rajiv Kosla, Rober J. Howlett, Lakhmi C. Jain(Eds.).
Proceedings of 9th International Conference on Knowledge-Based Intelligent
Information and Engineering Systems: Data mining and soft computing applications-II,
Lecture Notes in Computer Science, vol. 3684, pages 64-71, Melbourne, Australia,
September, 2005. Springer Verlag.
Siddiqui, Tanveer J. and Tiwary, U. S., “A hybrid model to improve relevance in document
retrieval.” Journal of Digital Information Management, 4(1), 2006, p.73-81.
Resources
Text Books: Salton, Rijsberjen
Sander Dominich
Frakes & Baeza-Yates
Suggested Readings:
,
142
Robertson, Sparck-Jones, Voorhees
Myaeng, Liddy, Khoo, I. Ounis, Gelbukh, A F Smeaton, T Strzalkowski
B. J. Jansen, A. Spink
Padmini Srinivasan, M Mitra, A Singhal
T. Saracevic, D. Harman, P. Borlund (evaluation/relevance)
Resources
Information Retrieval
Information Processing & Management
JASIS (J. of American Soc. for Info. Science)
TOIS(ACM trans. On Information System)
143
TOIS(ACM trans. On Information System)
Information Research (Online)
Proceedings of SIGIR/TREC conferences
Jou. of Digital Information Management
KAIS (Springer)
Thank You
144

More Related Content

PPTX
01 IRS-1 (1) document upload the link to
PPTX
01 IRS to upload the data according to the.pptx
PDF
Chapter 1 Introduction to ISR (1).pdf
PPT
3392413.ppt information retreival systems
PDF
Information Retrieval and Map-Reduce Implementations
PPTX
Introduction to Information Retrieval (concepts and principles)
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
01 IRS-1 (1) document upload the link to
01 IRS to upload the data according to the.pptx
Chapter 1 Introduction to ISR (1).pdf
3392413.ppt information retreival systems
Information Retrieval and Map-Reduce Implementations
Introduction to Information Retrieval (concepts and principles)
Information_Retrieval_Models_Nfaoui_El_Habib

Similar to ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf (20)

PPTX
Information retrieval introduction
PDF
Chapter 1: Introduction to Information Storage and Retrieval
PPT
Intro.ppt
PPT
Information Retrieval and Storage Systems
PPTX
Chapter 1.pptx
PPTX
Information retrival system and PageRank algorithm
PDF
Information retrieval concept, practice and challenge
PDF
Information retrieval systems irt ppt do
PPT
Information retrival system it is part and parcel
PPT
information retirval system,search info insights in unsturtcured data
PPTX
Chapter 1 Intro Information Rerieval.pptx
PPTX
Tdm information retrieval
PDF
Chapter 1 Introduction to Information Storage and Retrieval.pdf
PPTX
information retrival in natural language processing.pptx
PPT
Information Retrieval
PPTX
Week14-Multimedia Information Retrieval.pptx
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
DOCX
unit 1 INTRODUCTION
PDF
A Simple Information Retrieval Technique
Information retrieval introduction
Chapter 1: Introduction to Information Storage and Retrieval
Intro.ppt
Information Retrieval and Storage Systems
Chapter 1.pptx
Information retrival system and PageRank algorithm
Information retrieval concept, practice and challenge
Information retrieval systems irt ppt do
Information retrival system it is part and parcel
information retirval system,search info insights in unsturtcured data
Chapter 1 Intro Information Rerieval.pptx
Tdm information retrieval
Chapter 1 Introduction to Information Storage and Retrieval.pdf
information retrival in natural language processing.pptx
Information Retrieval
Week14-Multimedia Information Retrieval.pptx
14. Michael Oakes (UoW) Natural Language Processing for Translation
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
unit 1 INTRODUCTION
A Simple Information Retrieval Technique
Ad

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Well-logging-methods_new................
PPT
Project quality management in manufacturing
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PPTX
Sustainable Sites - Green Building Construction
PPT
Mechanical Engineering MATERIALS Selection
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
bas. eng. economics group 4 presentation 1.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Lecture Notes Electrical Wiring System Components
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Well-logging-methods_new................
Project quality management in manufacturing
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
Sustainable Sites - Green Building Construction
Mechanical Engineering MATERIALS Selection
Ad

ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

  • 1. Web Information Retrieval 1 Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad
  • 2. Example of a query Engine 1 2
  • 3. Topics –Algorithmic issues in classic information retrieval (IR), e.g. indexing, – issues related to the Web IR 3 (IR), e.g. indexing, stemming, etc.
  • 4. Information Retrieval  Information retrieval (IR) deals with the organization, storage, retrieval and evaluation of information relevant to user’s query. 4 user’s query.  A user having an information need formulates a request in the form of query written in natural language. The retrieval system responds by retrieving document that seems relevant to the query
  • 5. Information Retrieval  Traditionally it has been accepted that information retrieval system does not return the actual information but the documents containing that information. As Lancaster pointed out: 5 pointed out: ‘An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of her inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to her request.’
  • 6.  A question answering system provides user with the answers to specific questions.  Data retrieval systems retrieve precise data 6  Data retrieval systems retrieve precise data  An information retrieval system does not search for specific data as in data retrieval system, nor search for direct answers to a question as in question answering system.
  • 8. Classic IR  Input: Document collection  Goal: Retrieve documents or text with information content that is relevant to user’s information need 8 user’s information need  Two aspects: 1. Processing the collection 2. Processing queries (searching)
  • 9. Information Retrieval Model An IR model is a pattern that defines several aspects of retrieval procedure, for example, 9  how the documents and user’s queries are represented  how system retrieves relevant documents according to users’ queries &  how retrieved documents are ranked.
  • 10. IR Model  An IR model consists of - a model for documents - a model for queries and 10 - a model for queries and - a matching function which compares queries to documents.
  • 11. Classical IR Model IR models can be classified as:  Classical models of IR  Non-Classical models of IR 11  Non-Classical models of IR  Alternative models of IR
  • 12. Classical IR Model  based on mathematical knowledge that was easily recognized and well understood  simple, efficient and easy to implement  The three classical information retrieval 12  The three classical information retrieval models are: -Boolean -Vector and -Probabilistic models
  • 13. Non-Classical models of IR Non-classical information retrieval models are based on principles other than similarity, probability, Boolean operations etc. on which classical retrieval models are 13 etc. on which classical retrieval models are based on. information logic model, situation theory model and interaction model.
  • 14. Alternative IR models  Alternative models are enhancements of classical models making use of specific techniques from other fields. 14 Example: Cluster model, fuzzy model and latent semantic indexing (LSI) models.
  • 15. Information Retrieval Model  The actual text of the document and query is not used in the retrieval process. Instead, some representation of it.  Document representation is matched with 15  Document representation is matched with query representation to perform retrieval  One frequently used method is to represent document as a set of index terms or keywords
  • 16. Indexing  The process of transforming document text to some representation of it is known as indexing. 16  Different index structures might be used. One commonly used data structure by IR system is inverted index.
  • 17. inverted index  An inverted index is simply a list of keywords (tokens/terms), with each keyword having pointers to the documents containing that keyword 17 documents containing that keyword -A document is tokenized. A nonempty sequence of characters not including spaces and punctuations can be regarded as tokens
  • 18. inverted index -Each distinct token in the collection may be represented as a distinct integer -A document is thus transformed into a 18 -A document is thus transformed into a sequence of integers
  • 19. Example : Inverted index D1  The weather is cool. D2  She has a cool mind. The inverted index can be represented as a table tid did pos 1 (The) 1 1 2 (Weather) 1 2 3 (is) 1 3 4 (cool) 1 4 19 represented as a table called POSTING. 4 (cool) 1 4 5 (She) 2 1 6 (has) 2 2 7 (a) 2 3 4 (cool) 2 4 8 (mind) 2 5
  • 20. The  d1 weather  d1 Example : Inverted index The  d1/1 weather  d1/2 is  d1/3 20 weather  d1 is  d1 cool  d1,d2 She  d2 has  d2 a  d2 mind  d2 is  d1/3 cool  d1/4,d2/4 She  d2/1 has  d2/2 a  d2/3 mind  d2/5
  • 21.  The computational cost involved in adopting a full text logical view (i.e. using full set of words to represent a document) is high. 21 is high. Hence, some text operations are usually performed to reduce the set of representative keywords.
  • 22.  Two most commonly used text operations are: 1. Stop word elimination and 22 1. Stop word elimination and 2. Stemming  Zipf’s law
  • 23.  Stop word elimination involves removal of grammatical or function words while stemming reduces distinct words to their common grammatical root 23 their common grammatical root
  • 24. Indexing  Most of the indexing techniques involve identifying good document descriptors, such as keywords or terms, to describe information content of the documents. 24 information content of the documents.  A good descriptor is one that helps in describing the content of the document and in discriminating the document from other documents in the collection.
  • 25. Term can be a single word or it can be multi- word phrases. Example: Design Features of Information 25 Design Features of Information Retrieval systems can be represented by the set of terms : Design, Features, Information, Retrieval, systems or by the set of terms: Design, Features, Information Retrieval, Information Retrieval systems
  • 26. Luhn’s early Assumption  Luhn assumed that frequency of word occurrence in an article gives meaningful identification of their content. 26  discrimination power for index terms is a function of the rank order of their frequency of occurrence
  • 27. Stop Word Elimination  Stop words are high frequency words, which have little semantic weight and are thus unlikely to help with retrieval. 27  Such words are commonly used in documents, regardless of topics; and have no topical specificity.
  • 28. Example : articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”, “at” etc.). 28 (e.g. “in”, “of”, “for”, “at” etc.).
  • 29. Stop Word Elimination  Advantage Eliminating these words can result in considerable reduction in number of index terms without losing any significant information. 29 terms without losing any significant information.  Disadvantage It can sometimes result in elimination of terms useful for searching, for instance the stop word A in Vitamin A. Some phrases like “to be or not to be” consist entirely of stop words.
  • 30. Stop Words  About, above, accordingly, afterwards, again, against, alone, along, already, am, among, amongst, and, another, any, anyone, anything, anywhere, 30 any, anyone, anything, anywhere, around, as, aside, awfully, be, because
  • 31. Stemming  Stemming normalizes morphological variants  It removes suffixes from the words to 31  It removes suffixes from the words to reduce them to some root form e.g. the words compute, computing, computes and computer will all be reduced to same word stem comput.
  • 32. Stemming  Porter Stemmer(1980).  Example: The stemmed representation of 32 The stemmed representation of Design Features of Information Retrieval systems will be {design, featur, inform, retriev, system}
  • 33. Stemming stemming throws away useful distinction. In some cases it may be useful to help conflate similar terms resulting in increased recall in others it may be harmful 33 increased recall in others it may be harmful resulting in reduced precision
  • 34. Zipf’s law Zipf law frequency of words multiplied by their ranks in a large corpus 34 by their ranks in a large corpus is approximately constant, i.e. constant rank frequency  
  • 35. 35 Relationship between frequency of words and its rank order
  • 36. Luhn’s assumptions  Luhn(1958) attempted to quantify the discriminating power of the terms by associating their frequency of occurrence (term frequency) within the document. He postulated that: 36 -The high frequency words are quite common (function words) - low frequency words are rare words - medium frequency words are useful for indexing
  • 37. Boolean model  the oldest of the three classical models.  is based on Boolean logic and classical set theory. 37 set theory.  represents documents as a set of keywords, usually stored in an inverted file.
  • 38. Boolean model  Users are required to express their queries as a boolean expression consisting of keywords connected with boolean logical operators (AND, OR, 38 boolean logical operators (AND, OR, NOT).  Retrieval is performed based on whether or not document contains the query terms.
  • 39. Boolean model Given a finite set T = {t1, t2, ...,ti,...,tm} of index terms, a finite set 39 of index terms, a finite set D = {d1, d2, ...,dj,...,dn} of documents and a boolean expression in a normal form - representing a query Q as follows: } , { θi ), θ ( i i i t t Q     
  • 40. Boolean model 1. The set Ri of documents are obtained that contain or not term ti: Ri = { }, j j d i d   | , } , { i i t t i    40 Ri = { }, where 2. Set operations are used to retrieve documents in response to Q: j j d i d   | } , { t t i    j i j i d t d t    means i R 
  • 41. Example: Boolean model Let the set of original documents be D = {D1, D2, D3} Where D1 = Information retrieval is concerned with the organization, storage, retrieval and evaluation of 41 organization, storage, retrieval and evaluation of information relevant to user’s query. D2 = A user having an information need formulates a request in the form of query written in natural language. D3 = The retrieval system responds by retrieving document that seems relevant to the query.
  • 42. Example: Boolean model Let the set of terms used to represent these documents be: T = {information, retrieval, query} Then the set D of document will be represented as follows: 42 as follows: D = { d1, d2, d3} Where, d1 = {information, retrieval, query} d2 = {information, query} d3 = {retrieval, query}
  • 43. Example (Contd.) Let the query Q be: First, the following sets S1 and S2 of documents are retrieved in response to Q: retrieval n informatio   Q 43 are retrieved in response to Q: S1 = { } = {d1, d2} S2 = { } = {d1, d3} Then, the following documents are retrieved in response to query Q: { } = {d1} j j d d  on |informati j j d d  |retrieval 2 1 j j S S d | d  
  • 44. Boolean model  the model is not able to retrieve documents partly relevant to user query  unable to produce a ranked list of 44  unable to produce a ranked list of documents.  users seldom comprise their query with pure Boolean expression that this model requires.
  • 45. Extended Boolean model  To overcome weaknesses of Boolean model, numerous extensions have been suggested that do provide a ranked list of documents. 45 of documents.  Discussion of these extensions are beyond the scope of this tutorial. Ref. P-norm model (Salton et al., 1983) and Paice model (Paice, 1984).
  • 46. Vector Space Model  It represents documents and queries as vectors of features representing terms  features are assigned some numerical value that is usually some function of 46 value that is usually some function of frequency of terms  Ranking algorithm compute similarity between document and query vectors to yield a retrieval score to each document.
  • 47. Vector Space Model  Given a finite set of n documents: D = {d1, d2, ...,dj,...,dn} and a finite set of m terms: 47 T = {t1, t2, ...,ti,...,tm}  Each document will be represented by a column vector of weights as follows: (w1j, w2j, w3j, . . wij , … wmj)t wij is weight of term ti in document dj.
  • 48. Vector Space Model The document collection as a whole will be represented by an m x n term– document matrix as: 48               mn mj m2 m1 in ij i2 i1 2n 2j 22 21 1n 1j 12 11 w ... w .... w w w ... w .... w w w ... w .... w w w ... w .... w w
  • 49. Example: Vector Space Model D1 = Information retrieval is concerned with the organization, storage, retrieval and evaluation of information relevant to user’s query. D2 = A user having an information need 49 D2 = A user having an information need formulates a request in the form of query written in natural language. D3 = The retrieval system responds by retrieving document that seems relevant to the query.
  • 50. Example: Vector Space Model Let the weights be assigned based on the frequency of the term within the document. 50 The term – document matrix is:           1 1 1 1 0 2 0 1 2
  • 51. Vector Space Model  Raw term frequency approach gives too much importance to the absolute values of various coordinates of each document 51
  • 52. Consider two document vectors (2, 2, 1)t (4, 4, 2)t 52 (4, 4, 2) The documents look similar except the differences in magnitude of term weights.
  • 53. Normalizing term weights  To reduce the importance of the length of document vectors we normalize document vectors  Normalization changes all the vectors to 53  Normalization changes all the vectors to a standard length. We can convert document vectors to unit length by dividing each dimension by the overall length of the vector.
  • 54. Normalizing the term-document matrix:           1 1 1 1 0 2 0 1 2 54 We get Elements of each column are divided by the length of the column vector ( )   1 1 1           0.71 0.71 0.33 0.71 0 0.67 0 0.71 0.67  i ij w 2
  • 55. Term weighting Luhn’s postulate can be refined by noting that: 1. The more a document contains a given word the more that document is about a concept represented by that word. 55 represented by that word. 2. The less a term occurs in particular document in a collection, the more discriminating that term is.
  • 56. Term weighting  The first factor simply means that terms that occur more frequently represent its meaning more strongly than those occurring less frequently 56 occurring less frequently  The second factor considers term distribution across the document collection.
  • 57. Term weighting  a measure that favors terms appearing in fewer documents is required The fraction n/ni, exactly gives this measure where, 57 where, n is the total number of the document in the collection & ni is the number of the document in which term i occurs
  • 58. Term weighting  As the number of documents in any collection is usually large, log of this measure is usually taken, resulting in the following form of inverse document frequency (idf) term weight: 58 document frequency (idf) term weight:
  • 59. Tf-idf weighting scheme tf - document specific statistic 59 tf - document specific statistic idf - is global statistic and attempts to include distribution of term across document collection.
  • 60. Tf-idf weighting scheme  The term frequency (tf) component is document specific statistic that measures the importance of term within the document 60 the document  The inverse document frequency (idf) is global statistic and attempts to include distribution of term across document collection.
  • 61. Tf-idf weighting scheme Example: Computing tf-idf weight 61
  • 62. Normalizing tf and idf factors  by dividing the term frequency by the frequency of the most frequent term in the document  idf can be normalized by dividing it by the 62  idf can be normalized by dividing it by the logarithm of the collection size (n).
  • 63. Term weighting schemes  A third factor that may affect weighting function is the document length  the weighting schemes can thus be characterized by the following three factors: 63 characterized by the following three factors: 1. Within-document frequency or term frequency 2. Collection frequency or inverse document frequency 3. Document length
  • 64.  term weighting scheme can be represented by a triple ABC A - tf component 64 A - tf component B - idf component & C - length normalization component.
  • 65.  Different combinations of options can be used to represent document and query vectors. 65  The retrieval model themselves can be represented by a pair of triples like nnn.nnn (doc = “nnn”, query = “nnn”)
  • 66. Options for the three weighting factors  Term frequency within document (A) n Raw term frequency tf = tfij b tf = 0 or 1 (binary weight)   tf 66 a Augmented term frequency l tf = ln(tfij) + 1.0 Logarithmic term frequency  Inverse Document frequency (B) n wt = tf no conversion t Multiply tf with idf         j ij D in max tf tf 0.5 0.5 tf
  • 67. Options for the three weighting factors  Document length (C) n wij = wt (no conversion) c wij is obtained by dividing each 67 c wij is obtained by dividing each wt by sqrt(sum of(wts squared))
  • 68. Indexing Algorithm Step 1. Tokenization: This extracts individual terms (words) in the document, converts all the words in the lower case and removes punctuation marks. The output of the first stage is a representation of the document as a stream of terms. 68 stream of terms. Step 2. Stop word elimination: Removes words that appear more frequently in the document collection. Step 3. Stemming: reduce remaining terms to their linguistic root, to get index terms. Step 4. Term weighting: Assigns weights to term according to their importance in the document, in the collection or some combination of both.
  • 69. Example: Document Representation Stemmed terms Document 1 Document 2 Document 3 69 Stemmed terms Document 1 Document 2 Document 3 inform 0 0 1 intellig 0 0 1 model 1 1 0 probabilist 0 1 0 retriev 0 1 1 space 1 0 0 technique 0 0 1 vector 1 0 0
  • 70. Similarity Measures 70  inner product  cosine similarity ) , ( ) , ( 1 ik m i ij k j k j w w q d q d sim      ) , ( ) , ( 1 2 1 2 1           m i ij m i ik ik m i ij k j k j k j w w w w q d q d q d sim
  • 71. Evaluation of IR Systems  The evaluation of IR system is the process of assessing how well a system meets the information needs of its users (Voorhees, 2001).  Criteria's for evaluation -Coverage of the collection 71 -Coverage of the collection -Time lag - Presentation format - User effort - Precision - Recall
  • 72. Evaluation of IR Systems  The IR evaluation models can be broadly classified as system driven model and user- centered model.  System driven model focus on measuring how 72  System driven model focus on measuring how well the system can rank documents  user–centered evaluation model attempt to measure the user’s satisfaction with the system.
  • 73. Why System Evaluation?  There are many retrieval models/ algorithms/ systems, which one is the best?  What is the best component for: • 73  What is the best component for: • Ranking function (dot-product, cosine, …) • Term selection (stop word removal, stemming…) • Term weighting (TF, TF-IDF,…)  How far down the ranked list will a user need to look to find some/all relevant documents?
  • 74. Evaluation of IR Systems  Traditional goal of IR is to retrieve all and only the relevant documents in response to a query All is measured by recall: the proportion of 74  All is measured by recall: the proportion of relevant documents in the collection which are retrieved  Only is measured by precision: the proportion of retrieved documents which are relevant
  • 75. Precision vs. Recall Retrieved All docs RelRetrieved 75 Relevant | Relevant | | | Recall ed RelRetriev  | Retrieved | | | Precision ed RelRetriev 
  • 76. Trade-off between Recall and Precision 1 Precision The ideal Returns relevant documents but misses many useful ones too 76 1 0 Recall Precision Returns most relevant documents but includes lots of junk
  • 77. Test collection approach  The total number of relevant documents in a collection must be known in order for recall to be calculated.  To provide a framework of evaluation of IR 77  To provide a framework of evaluation of IR systems, a number of test collections have been developed (Cranfield, TREC etc.).  These document collections are accompanied by a set of queries and relevance judgments.
  • 78. IR test collections Collection Number of documents Number of queries Cranfield 1400 225 CACM 3204 64 CISI 1460 112 LISA 6004 35 78 TIME 423 83 ADI 82 35 MEDLINE 1033 30 TREC-1 742,611 100 ___________________________________________________________
  • 79. Fixed Recall Levels  One way to evaluate is to look at average precision at fixed recall levels • Provides the information needed for 79 • Provides the information needed for precision/recall graphs
  • 80. Document Cutoff Levels  Another way to evaluate: • Fix the number of documents retrieved at several levels: • top 5 • top 10 • top 20 80 • top 20 • top 50 • top 100 • Measure precision at each of these levels • Take (weighted) average over results  focuses on how well the system ranks the first k documents.
  • 81. Computing Recall/Precision Points  For a given query, produce the ranked list of retrievals.  Mark each document in the ranked list that is relevant according to the gold standard. 81 is relevant according to the gold standard.  Compute a recall/precision pair for each position in the ranked list that contains a relevant document.
  • 82. Computing Recall/Precision Points: An Example n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 82 R=3/6=0.5; P=3/4=0.75 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 R=2/6=0.333; P=2/2=1 R=5/6=0.833; p=5/13=0.38 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall
  • 83. Interpolating a Recall/Precision Curve  Interpolate a precision value for each standard recall level: • rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} 83 1.0} • r0 = 0.0, r1 = 0.1, …, r10=1.0  The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level greater than or equal to j-th level.
  • 84. Example: Interpolated Precision Precision at observed recall points: Recall Precision 0.25 1.0 0.4 0.67 0.55 0.8 The interpolated precision : 0.0 1.0 0.1 1.0 0.2 1.0 0.3 0.8 0.4 0.8 84 0.55 0.8 0.8 0.6 1.0 0.5 0.4 0.8 0.5 0.8 0.6 0.6 0.7 0.6 0.8 0.6 0.9 0.5 1.0 0.5 Interpolated average precision = 0.745
  • 86. Average Recall/Precision Curve  Compute average precision at each standard recall level across all queries.  Plot average precision/recall curves to 86  Plot average precision/recall curves to evaluate overall system performance on a document/query corpus.
  • 87. Average Recall/Precision Curve Model 1. doc = “atn”, query = “ntc” 2. doc = “atn”, query = “atc” 3. doc = “atc”, query = “atc” 87 3. doc = “atc”, query = “atc” 4. doc = “atc”, query = “ntc” 5. doc = “ntc”, query = “ntc” 6. doc = “ltc”, query = “ltc” 7. doc = “nnn”, query= “nnn”
  • 88. Problems with Precision/Recall  Can’t know true recall value • except in small collections  Precision/Recall are related • 88 • A combined measure sometimes more appropriate  Assumes batch mode • Interactive IR is important and has different criteria for successful searches  Assumes a strict rank ordering matters.
  • 89. Other measures: R-Precision  R-Precision is the precision after R documents have been retrieved, where R is the number of relevant documents for 89 is the number of relevant documents for a topic. • It de-emphasizes exact ranking of the retrieved relevant documents. • The average is simply the mean R-Precision for individual topics in the run.
  • 90. Other measures: F-measure  F-measure takes into account both precision and recall. It is defined as harmonic mean of recall and precision. 90 R P 2PR F  
  • 91. Evaluation Problems  Realistic IR is interactive; traditional IR methods and measures are based on non-interactive situations 91 non-interactive situations  Evaluating interactive IR requires human subjects (no gold standard or benchmarks) [Ref.: See Borlund, 2000 & 2003; Borlund & Ingwersen, 1997 for IIR evaluation]
  • 92. Web IR : II Tutorial on Web IR (Contd.) 92 Tutorial on Web IR (Contd.)
  • 93. Web Page: Basics  are written in a tagged markup language called the Hypertext Markup Language (HTML)  Contain hyperlinks - Hyperlink creates a link to another web page using a uniform resource locator (URL) - A URL(http://www.icdim.org/program.asp) contains 93 - A URL(http://www.icdim.org/program.asp) contains a protocol field (http) a server hostname ( www.icdim.org) a file path (/, the ‘root of the published file system)  are served through the Internet using Hypertext transfer protocol (Http) to client computer - Http is build on top of TCP (Transport control protocol)  can be viewed using browsers
  • 94. IR on the Web  Input: The publicly accessible Web  Goal: Retrieve high quality pages that are relevant to user’s need – Static (files: text, audio, … ) – Dynamically generated on request: mostly data base 94 – Dynamically generated on request: mostly data base access Two aspects: 1. Processing and representing the collection • Gathering the static pages • “Learning” about the dynamic pages 2. Processing queries (searching)
  • 95. How Web IR differs from classic IR? The Web is:  Huge,  Dynamic 95  Dynamic  Self-organized, and  hyperlinked
  • 96. How Web IR differs from classic IR? 1. Pages: Bulk …………………… >1B (12/99) Lack of stability……….. Estimates: 23%/day, 38%/week  Heterogeneity – Type of documents .. Text, pictures, audio, scripts,… 96 – Type of documents .. Text, pictures, audio, scripts,… – Quality ……………… – Language ………….. 100+ Duplication – Syntactic……………. 30% (near) duplicates – Semantic……………. ?? High linkage……………≥ 8 links/page in the average
  • 97. The big challenge Meet the user needs given the heterogeneity of Web pages 97 heterogeneity of Web pages
  • 98. How Web IR differs from classic IR? 2. Users  Make poor queries – Short (2.35 terms avg) – Imprecise terms 98 – Sub-optimal syntax (80% queries without operator) – Low effort
  • 99. How Web IR differs from classic IR?  Wide variance in – Needs – Knowledge 99 – Bandwidth  Specific behavior – 85% look over one result screen only –78% of queries are not modified – Follow links
  • 100. The big challenge Meet the user needs given the 100 Meet the user needs given the heterogeneity of Web pages and the poorly made queries.
  • 101. Web IR : The bright side  Many tools available  Personalization  Interactivity (refine the query if needed) 101 Interactivity (refine the query if needed)
  • 102. Web IR tools  General-purpose search engines: – Direct: AltaVista, Excite, Google, Infoseek, Lycos, …. – Indirect (Meta-search): MetaCrawler, 102 – Indirect (Meta-search): MetaCrawler, AskJeeves, ...  Hierarchical directories: Yahoo!, The Best of the Web (http://botw.org)
  • 103. Best of the Web 103
  • 104. Web IR tools Specialized search engines: – Home page finder: Ahoy – Shopping robots: Jango, Junglee,… – School and University: searchedu.com 104 – School and University: searchedu.com – Live images from everywhere: EarthCam (http://www.earthcam.com) – The Search Engines Directory http://www.searchengineguide.com/searchengines.html Search-by-example: Alexa’s “What’s related”, Excite’s “More like this”
  • 105. Search Engines’ Component 1. Crawler (Spider, robot, or bot)– fetches web pages 2. Indexer -- process and represents the 105 2. Indexer -- process and represents the data 3. Search interface -- answers queries
  • 106. Crawler: Basic Principles A Crawler collects web pages by scanning collected web pages for hyperlinks to other pages that have not been collected yet. 1. Start from a given set of URLs 106 1. Start from a given set of URLs 2. fetch and scan them for new URLs (outlinks) 3. fetch these pages in turn 4. Go to step 1 (Repeat steps 1-3 for new pages) [Ref. : Heydon & Najork, 1999; Brin & Page, 1998]
  • 107. Indexing the web In traditional IR, documents are static and non-linked. Web documents are dynamic: documents are added, modified or deleted. The mappings from terms to document and positions 107 The mappings from terms to document and positions are not constructed incrementally (Slide 19). How to update index? Web documents are linked: contain hyperlinks How to rank documents?
  • 108. Web pages are dynamic: Some figures 40% of all webpages in their dataset changed within a week, and 23% of the .com pages changed daily 108 [Junghoo Cho and Hector Garcia-Molina, 2000]
  • 109. Updating index : A Solution - A static index is made which is the main index used for answering queries - A signed (d,t) record as (d,t,s) is maintained for documents being added or deleted (or modified), where s is a bit to specify if the document has been deleted or 109 s is a bit to specify if the document has been deleted or inserted. - A stop-press index is created using (d,t,s) record. - Query is sent to both main index and stop-press index. - main index returns a set of document (D0 ) - stop-press index returns two sets (D+ and D-) D+ is the set of documents not yet indexed and D- is set of document matching the query that have been removed from the collection since D0 is constructed.
  • 110. Updating index - the retrieved set is constructed as D0 U D+ D- - When the stop-press index gets too large, the signed (d,t,s) records are 110 large, the signed (d,t,s) records are sorted in (d,t,s) order and merge-purged into the master (t,d) records. - Master index is rebuilt - stop-press index is emptied
  • 111. Index Compression Techniques  A significant portion of index space is used by document IDs.  Delta encoding is used to save space - Sort document IDs in increasing order 111 - Sort document IDs in increasing order - Store first ID in full, and only gaps (i.e. the difference from previous ID) for subsequent entries
  • 112. Delta Encoding: Example Suppose the word harmony appears in document 50, 75, 110, 120, 125, 170, 200. The record for harmony is the vector (50,25,35,10,5,45,30) 112 vector (50,25,35,10,5,45,30)
  • 113. Other Issues  Spamming Adding popular terms to pages unrelated to those terms  Titles, headings, metatags search engines give additional weights to terms occurring in titles, headings, font modifiers and metatgs. Approximate string matching 113  Approximate string matching -Soundex -n-gram (Google suggests variant spellings of query terms based on query log using n- gram)  Metasearch engines Meta-search engines send the query to several search engines at once and return the results from all of the search engines in one long unified list.
  • 115. Shares of Searches: July 2006 115
  • 116. Link Analysis  Link analysis is a technique that exploited the additional information inherent in the hyperlink structure of the Web, to improve the quality of search results.  Link analysis or ranking algorithms underlie 116  Link analysis or ranking algorithms underlie several of today’s most popular and successful search engines  All major search engines combine link analysis scores with more traditional information retrieval scores to retrieve web pages in response to a query  PageRank and HITS are two algorithms for ranking web pages based on links.
  • 117. Google’s Search behaviour: Few points  Google considers over a hundred factors in determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, term order and the proximity of the search terms to one another on the page  Words in a special typeface, bold, underlined or all capitals, 117  Words in a special typeface, bold, underlined or all capitals, get extra credit.  Google can also match multi-word phrases and sentences  Google does not disclose everything that goes into its search index, but the cornerstone algorithm, called PageRank, is well known.  For more information on Google's technology, visit www.google.com/technology
  • 118. Google’s Search behaviour  Google returns only pages that match all your search terms. A search for [ cricket match ] finds pages 118 A search for [ cricket match ] finds pages containing the words " cricket " and " match "
  • 119. Google’s Search behaviour  Google returns pages that match your search terms exactly. If you search for “tv” Google won't find 119 If you search for “tv” Google won't find “television”, if you search for “cheap” Google won't find “inexpensive”.
  • 120. Google’s Search behaviour  Google returns pages that match variants of your search terms. The query [ child lock] finds pages that 120 The query [ child lock] finds pages that contain words that are similar to some or all of your search terms, e.g., "child," "children”, or "children's," “locks" or “locking”.
  • 121. Google’s Search behaviour  Google ignores stop words  Google favors results that have your search terms near each other. 121 search terms near each other. (Google considers the proximity of your search terms within a page)
  • 122. Google’s Search behaviour  Google gives higher priority to pages that have the terms in the same order as in your query.  Google is NOT case sensitive; it assumes all search terms are lowercase 122 search terms are lowercase  Google ignores some punctuation and special characters, including , . ; ? [ ] ( ) @ / # . [ Dr. Tanveer] returns the same results as [ Dr Tanveer ]
  • 123. Google’s Search behaviour  A term with an apostrophe (single quotes) doesn't match the term without an apostrophe.  Google searches for variations on any 123  Google searches for variations on any hyphenated terms. [e-mail ] matches "e-mail," "email," and "e mail"
  • 125. Google’s PageRank  PageRank is a system for ranking web pages, developed by the Google founders Larry Page and Sergey Brin at Stanford University  PageRank determines the quality of a Web page 125  PageRank determines the quality of a Web page by the pages that link to it.  Once a month, Google's spiders crawl the Web to update and enhance the Google index  PageRank is a numeric value that represents how important a page is on the web
  • 126. Google’s PageRank  PageRank looks at a Web page and determines how many other pages link to it (a measure of popularity). PageRank then analyze the links to those pages.  When one page links to another page, it is 126  When one page links to another page, it is effectively casting a vote for the other page.  The more votes that are cast for a page, the more important the page must be.  Google calculates a page's importance from the votes cast for it.
  • 127. Google’s PageRank PageRank of Page A, PR(A) is : PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) ) Where, 127 T1...Tn are pages linking to page A d is a dampening factor, usually set to 0.85 PR(T) is the PageRank of Page T. C(T) is the number of links going out of page T. PR(T)/C(T) is the PageRank of Page T divided by the number of links going out of that page.
  • 128. Google’s PageRank More simply, page's PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it) "share" = the linking page's PageRank divided 128  "share" = the linking page's PageRank divided by the number of outgoing links on the page  A page "votes" an amount of PageRank onto each page that it links to
  • 129. Google’s PageRank The PageRank of a page that links to yours is important but the number of links on that page is also important. 129 The more links there are on a page, the less PageRank value your page will receive from it.
  • 130. Google’s PageRank  page rank algorithm is applied by firstly guessing a PageRank for all the pages that have been indexed and then recursively iterating until the PageRank converges. what PageRank in effect says is that pages 130  what PageRank in effect says is that pages "vote" for other pages on the Internet. So if Page A links to Page B (i.e. Page A votes to page B), it is saying that B is an important page. - if lots of pages link to a page, then it has more votes and its worth should be higher
  • 131. Google’s PageRank: Example Example Let's consider a web site with 3 page (pages A, B and C) with no links coming in from the outside and an initial PageRank of 1 After one iteration: 131 After one iteration: PR(A) = (1-d) + d(......) = 0.15 PR(B) = PR(C) = 0.15 After 10 iteration: PR(A) = PR(B) = PR(C) = 0.15 A B C
  • 132. Example Now, we link page A to page B and run the calculations for each page. We end up with:- Page A = 0.15 A B 132 Page A = 0.15 Page B = 1 Page C = 0.15 After second iteration the figures are:- Page A = 0.15 Page B = 0.2775 Page C = 0.15 C
  • 133. Example Now, we Link all pages to all pages. And repeat calculations with initial pagerank of 1. A B 133 pagerank of 1. We get, Page A = 1 Page B = 1 Page C = 1 A C B
  • 134. Example Now remove the links between page B and page C. after 1 iteration the results are: PR(A) = 0.15 + 0.85(1+1) =1.85 PR(B) = 0.15 + 0.85(1/2) = 0.575 PR(C) = 0.575 and after second iteration, the results are: A C B 134 and after second iteration, the results are: PR(A) = 1.1275 PR(B) = 0.93625 PR(C) = 0.93625 after third iteration: PR(A) = 1.741625 PR(B) = 0.629 PR(C) = 0.629 ...
  • 135. Calculate pagerank values for the following: A B 135 A C B
  • 136. Semantic Web?  semantic stands for the meaning of. The semantic of something is the meaning of something. 136  Semantic Web is about describing things in a way that computers applications can understand.
  • 137. Semantic Web  Semantic Web is not about links between web pages.  It describes the relationships between 137  It describes the relationships between things (like A is a part of B and Y is a member of Z) and the properties of things (like size, weight, age, etc. )
  • 138. Semantic Web Semantic Web = a Web with a meaning "If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in 138 and inference languages will make all the data in the world look like one huge database" Tim Berners-Lee, Weaving the Web, 1999 The Semantic Web uses RDF to describe web resources.
  • 139. RDF The Resource Description Framework  RDF (Resource Description Framework) is a markup language for describing information and resources on the web. 139 and resources on the web.  Putting information into RDF files, makes it possible for computer programs ("web spiders") to search, discover, pick up, collect, analyze and process information from the web.
  • 140. References Soumen Chakrabarti, “Mining the Web: Discovering knowledge from hypertext data”, Elsevier, 2003. B. J. Jansen, A. Spink and T. Saracevic, ‘Real life, real user and real needs: A study and analysis of user queries on the Web”. Information Processing & Management, 36(2), 207-227, 2000 G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval”. Communication of the ACM; 26(11), 1022-026, 1983. G. Salton, and M. J. McGill, “Introduction to modern information retrieval”. New York: McGraw-Hill, 1983. C. J. van Rijsbergen “Information Retrieval”. 2nd ed. Butterworths, London. E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R. 140 E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R. Baeza-Yates (Eds.). Information Retrieval Data Structures and algorithms. Prentice Hall, pp. 393- 418, 1992. H. P. Luhn, “The automatic creation of literature abstracts”. IBM journal of Research and Development, 2(2), 1958. C. P. Paice. “Soft evaluation of Boolean search queries in information retrieval systems”. Information technology: Research and development, 3(1):33-42, 1984. M. F. Porter, “An algorithm for suffix stripping. Program”, 14(3), 130-137, 1980. A. Heydon and M. Najork: A scalable, extensible web crawler. World wide web Conference, 2(4), pages 219-229, 1999 S. Brin and L. page. The anatomy f large-scale hypertextual Web search engine. In proceedings of the 7th World Wide Web Conference, 1998. decweb.ethz.ch/WWW7/1921/com1921.htm.
  • 141. References P. Borlund and P. Ingwersen, “The development of a method for the evaluation of interactive information retrieval systems”. Journal of Documentation, 53:225–250, 1997. P. Borlund. “Experimental components for the evaluation of interactive information retrieval systems”. Journal of Documentation, Vol. 56, No. 1, 2000. P. Borlund, “The IIR evaluation model: a framework for evaluation of interactive information retrieval systems”. Information research, 8(3), 2003. 141 Anastasios Tombros, C.J. van Rijsbergen. “Query-sensitive similarity measures for information retrieval”. Knowledge and Information Systems, 6, 617–642, 2004. Tanveer J Siddiqui and Uma Shanker Tiwary, “Integrating Relation and keyword matching in information retrieval”. In Rajiv Kosla, Rober J. Howlett, Lakhmi C. Jain(Eds.). Proceedings of 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems: Data mining and soft computing applications-II, Lecture Notes in Computer Science, vol. 3684, pages 64-71, Melbourne, Australia, September, 2005. Springer Verlag. Siddiqui, Tanveer J. and Tiwary, U. S., “A hybrid model to improve relevance in document retrieval.” Journal of Digital Information Management, 4(1), 2006, p.73-81.
  • 142. Resources Text Books: Salton, Rijsberjen Sander Dominich Frakes & Baeza-Yates Suggested Readings: , 142 Robertson, Sparck-Jones, Voorhees Myaeng, Liddy, Khoo, I. Ounis, Gelbukh, A F Smeaton, T Strzalkowski B. J. Jansen, A. Spink Padmini Srinivasan, M Mitra, A Singhal T. Saracevic, D. Harman, P. Borlund (evaluation/relevance)
  • 143. Resources Information Retrieval Information Processing & Management JASIS (J. of American Soc. for Info. Science) TOIS(ACM trans. On Information System) 143 TOIS(ACM trans. On Information System) Information Research (Online) Proceedings of SIGIR/TREC conferences Jou. of Digital Information Management KAIS (Springer)