Lecture 9 - Machine Learning and Support Vector Machines (SVM)

Machine Learning & Support Vector Machines
Lecture 9
Sean A. Golliher

Let D be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.

Need to find p(R|D) - probability that a retrieved document D
is relevant.
p(D | R)p(R)
p(R | D) =
p(D) p(R),p(NR) - prior probability
p(xD | NR)p(NR) of retrieving a (non) relevant
p(NR | D) =
p(xD) document
P(D|R), p(D|NR) - probability that if a relevant (non-relev
document is retrieved, it is D.

 Suppose we have a vector representing the presence and
absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.
 What is the probability of this document occurring in the
relevant set?
 pi is the probability that the term i occurs in a relevant
set. (1- pi ) would be the probability a term would not be
included the relevant set.
 This gives us: p1 x (1-p2) x (1-p3) x p4 x p5

 Popular and effective ranking algorithm
based on binary independence model
 adds document and query term weights

 k1, k2 and K are parameters whose values are set
empirically
 dl is doc length
 Typical TREC value for k1 is 1.2, k2 varies from 0
to 1000, b = 0.75

 Query with two terms, “president lincoln”, (qf = 1).
Frequency of term i in the query
 No relevance information (r and R are zero)
 N = 500,000 documents
 “president” occurs in 40,000 documents (n1 = 40, 000)
 “lincoln” occurs in 300 documents (n2 = 300)
 “president” occurs 15 times in doc (f1 = 15)
 “lincoln” occurs 25 times (f2 = 25)
 document length is 90% of the average length (dl/avdl
= .9)
 k1 = 1.2, b = 0.75, and k2 = 100
 K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

 Unigram language model (simplest form)
 probability distribution over the words in a
language
 generation of text consists of pulling words out of
a “bucket” according to the probability distribution
and replacing them
 N-gram language model
 some applications use bigram and trigram
language models where probabilities depend on
previous words
 Based on previous n-1 words

 A topic in a document or query can be
represented as a language model
 i.e., words that tend to occur often when
discussing a topic will have high probabilities in
the corresponding language model

 Rank documents by the probability that the query
could be generated by the document language
model (i.e. same topic) P(Q|D)
 Assuming uniform, unigram model

 Obvious estimate for unigram probabilities is

 fqi, D is number of times word occurs in document.
D is number of words in document
 If query words are missing from document, score
will be zero
 Missing 1 out of 4 query words same as missing 3
out of 4. Not good for long queries!

 Document texts are a sample from the
language model
 Missing words should not have zero probability of
occurring (calculating probability query could be
generated from document)
 Smoothing is a technique for estimating
probabilities for missing (or unseen) words
 lower (or discount) the probability estimates for
words that are seen in the document text
 assign that “left-over” probability to the estimates
for the words that are not seen in the text

 Informational
 Finding information about some topic which may be on one or
more web pages
 Topical search
 Navigational
 finding a particular web page that the user has either seen before
or is assumed to exist
 Transactional
 finding a site where a task such as shopping or downloading
music can be performed

Broder (2002) http://www.sigir.org/forum/F2002/broder.pdf

 For effective navigational and transactional
search, need to combine features that reflect
user relevance
 Commercial web search engines combine
evidence from hundreds of features to
generate a ranking score for a web page
 page content, page metadata, anchor text, links
(e.g., PageRank), and user behavior (click logs)
 page metadata – e.g., “age”, how often it is
updated, the URL of the page, the domain name
of its site, and the amount of text content

 SEO: understanding the relative importance
of features used in search and how they can
be optimized to obtain better search rankings
for a web page
 e.g., improve the text used in the title tag, improve
the text in heading tags, make sure that the
domain name and URL contain important
keywords, and try to improve the anchor text and
link structure
 Some of these techniques are regarded as not
appropriate by search engine companies

 Toolkit, written in Java, for experimenting with text.

 http://www.galagosearch.org/quick-start.html

 Considerable interaction between these
fields
 Arthur Samuel: 1959 – Checkers game. World’s
first self-learning program. IBM701.
 Web query logs have generated new wave of
research
 e.g., “Learning to Rank”

 Supervised Learning
 Regression analysis
 Classification Problems
 Support Vector Machines (SVM)
 Unsupervised Learning
 http://www.youtube.com/watch?v=GWWIn29ZV4Q
 Reinforcement Learning
 Learning Theory
 How much training data do we need?
 How accurately can we predict an event to 99%
accuracy?

 Papers: Boser et al,. 1992
 Standard SVM [Cortes and Vapnik, 1995]

Lecture 9 - Machine Learning and Support Vector Machines (SVM)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Lecture 9 - Machine Learning and Support Vector Machines (SVM) (20)

More from Sean Golliher (8)

Recently uploaded (20)

Lecture 9 - Machine Learning and Support Vector Machines (SVM)

Editor's Notes