Using the search engine as recommendation engine

Recommendations from the search
engine
Sesam Hackathon, Warsaw, 2014-03-23
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1

This whole presentation is about Ted Dunning’s
proposed approach to recommendations
Based on his 1993 paper (below)
– references at the end
Very simple method, dead easy to implement
– seems to work pretty well
2
Inspiration

Usually designed as prediction of ratings
– Dunning believes this is the wrong approach
– people’s ratings don’t necessarily reflect what they’ll
buy
– go by what people do rather than what they say
You don’t want to recommend Bob Dylan
– everyone’s already heard about him, and know what
they think
– you want to recommend things that are new to the user
You don’t want to recommend things everyone
likes
3
Thoughts on recommendations

Step 1
– work out which things tend to occur together
– that is, if you buy this, you’re likely to also buy this
– however, we only want pairs which are statistically
significant
Step 2
– index up the significant pairs in a search engine
– use search to produce the actual results
4
The actual approach

Statistically significant co-
occurrence
Part the first

User Item
u1 i1
u1 i2
u2 i1
u3 i2
u3 i3
u3 i4
... ...
The starting point
Some kind of log of user actions
User has
– bought a movie | album | book | ...
– opened a document
– ...
From this raw material, we can work
out what things tend to go together
– and whether this is significant

i1 i2 i3 i4 i5 i6 i7
i1 23 42 0 0 5 7
i2 23 6 1 129 2 10
i3 42 6 3 0 492 1
i4 0 1 3 2 3 1
i5 0 129 0 2 94 2
i6 5 2 492 3 94 1
i7 7 10 1 1 2 1
8
Item-to-item matrix

k[0][0] = the number in the matrix on
previous slide
k[0][1] = the sum of that whole column
minus k[0][0]
k[1][0] = the sum of that whole row
minus k[0][0]
k[1][1] = the sum of the entire matrix
minus k[0][0] minus k[1][0] minus
k[0][1]
9
Producing the k 2x2 matrix
How to compute the k matrix for a given cell in the matrix
on the previous slide
If the output of LLR(k) is above some threshold, the pair is considered significant.

Check the Python code on
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
– this requires a lot of memory and CPU
Or just use Mahout
– RowSimilarityJob does exactly this
10
Doing it for real

Search engine as recommender
Part the second

Take all the items and index them up with the
search engine in the usual way
– that is, each title has an id, a title, a description, etc
Then, add a “magic” field
– put into it the IDs of all the items that appear in a
significant pair with this item
– let’s call this field “indicators”
Now we’re ready to do recommendations
12
Indexing with the search engine

Collect some set of items for which the user has
expressed a preference
– by buying them, looking at them, rating them, whatever
The IDs of these items are your query
– search the “indicators” field
– the search results are your recommendations
That’s it!
– pack up, go home
13
Doing recommendations

Imagine that you’re searching for movies, and you
type “the godfather”
– “the” appears in all documents, so documents matching that
get a low relevance score
– “godfather” appears in very few documents, so matches on
that get a high score
– this is basically TF/IDF in a nutshell
Now, imagine you liked two movies: “The Godfather”
and “The Daytrippers”
– nearly all movies have “The Godfather” as an indicator
– very few have “The Daytrippers”
– the second will therefore influence recommendations much
more
14
Why does it work?

Trying it out for real
Part the third

Again, the code is on Github
– very simple webapp based on web.py and Lucene
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
The underlying data is the MovieLens dataset
– 10 million ratings of 10,000 movies by 72,000 users
– http://grouplens.org/datasets/movielens/
16
Real demo with real data

llr.py
– this chews the data, producing the significant pairs
– takes huge amount of memory and about 30 minutes
– have made absolutely no attempts to optimize it
llr_index.py
– reads output of previous script, makes Lucene index
recom-ui.py
– the actual web application
17
Three scripts

21
Liked two movies
Movies with highest llr scor
together with this movie

22
Liked three movies
Recommendations are actually now spot-on. At least for me.

class Movie:
def GET(self, movieid):
nocache()
doc = search.do_query('id', movieid)[0]
#recoms = search.do_query('indicators', movieid)
recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets]
if hasattr(session, 'liked'):
youlike = search.do_query('indicators', session.liked)
else:
youlike = []
return render.movie(doc, recoms, youlike)
23
Complete code for movie page

Tweak the parameters a bit to see what happens
Can we support a “Dislike” button?
Test it with more kinds of data
Learn how to do this with Mahout
25
Things left to do

26
What is this?
From Ted Dunning’s slides

27
And this?

28
And this?

The original 1993 paper
– http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14
.5962
Ebook with lots of background but little detail
– http://www.mapr.com/practical-machine-learning
Slides covering the same material
– www.slideshare.net/tdunning/building-multimodal-
recommendation-engines-using-search-engines
Blog post with actual equations
– http://tdunning.blogspot.com/2008/03/surprise-and-
coincidence.html
29
References

Using the search engine as recommendation engine

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Using the search engine as recommendation engine (20)

More from Lars Marius Garshol (20)

Recently uploaded (20)

Using the search engine as recommendation engine