Collaborative Filtering

Collaborative Filtering Tayfun Şen 18 December 2006 You can reach the author at: stayfun{at}metu.edu.tr

What is the problem? In a nutshell: Life is too short! We don't have time to watch all the movies, listen to all the music, read every book etc...

Overwhelming quantity of information on the web We all ask our friends for recommendations. We read newspapers, web sites, watch TV to create an opinion for ourselves. We want to be sure that the activity we spend our time is worthwhile. We take into consideration the recommendations made by people we trust.

Time Person of the Year 2006: You? Yes, you. You control the Information Age. Welcome to your world.

From Time (25 Dec. 2006 edition) “It's a story about community and collaboration on a scale never seen before. It's about the cosmic compendium of knowledge Wikipedia and the million-channel people's network YouTube and the online metropolis MySpace. It's about the many wresting power from the few and helping one another for nothing and how that will not only change the world, but also change the way the world changes.”

Futurism Semantic Web? In his seminal paper published in Scientific American [1], creator of the WWW, Tim Berners-Lee talks about the semantic web. Adding meaning to the Internet looks like a ground breaking idea, but when will it be implemented? Standard ontologies, mappings between them, some sort of acceptance by the web community. 10-15 years needed maybe? Collaborative filtering saves the day.

Implications of the Recommendation in the Internet There are basically two types of filtering techniques in the Internet in use today: Content based filtering Collaborative filtering

Examples on the Internet Netflix, Amazon, Pandora.com, Last.fm ... It is natural for Web 2.0 too! Digg, flickr, stumbleupon etc. All these websites rely on their users' interaction to generate content relevant to every user. That's what Web 2.0 means. User interaction.

Content based algorithms These rely on the implicit data on the domain. For example, in a movie recommendation site, this could be the director information, movie length, PG rating, cast etc. For the song recommendation this could be song date, other albums/songs from the same group, type of the song (jazz, classic, rock etc.) Implicit data is used in generating recommendations. For example: You see that a user has rated high to Brad Pitt movies, so you recommend her Babel.

Collaborative Filtering algorithms In CF, it is a little different: Other users have impact on the recommendations. Users generate recommendations implicitly. Similar users to the active user (user that recommendations are prepared for) are found. By weighting the users, a recommendation list is prepared from other user data.

CF Example It is found that a lot of users like Ayumi Hamasaki songs, given that they also like Ai Otsuka songs. In this case, if the active user does not know about Ai Otsuka but she knows that she likes Ayumi, then Ai Otsuka is recommended to her.

CF Example (continued) In the movie domain: There is a user-movie-rating table. It is very sparse. That is, for many users, for many movies no ratings exist.

CF Algorithms Two types of algorithms exist for CF: Model based algorithms Memory based algorithms In model based algorithms, you create a model of the domain. Most of the work is done offline. In memory based algorithms, you use the whole database in creating recommendations. Most of the work is done online.

Model based CF Model based algorithms are efficient (fast when recommending) and quite accurate (predictions are quite good). But they rely on long off line computations. Thus they are harder to maintain and update. In the Internet, new users need to be added all the time, so this creates a setback or model based algorithms. An example is Bayesian Networks:

Memory based CF Many memory based CF algorithms exist, with the most known one described by Herlocker [4], the neighborhood based algorithm. In neighborhood based algorithms, most similar users to the active user is selected as that users neighborhood. After the neighborhood is found, the predictions are made using a weighted sum of the ratings by those neighboring users.

Neighborhood based algorithms For finding the neighbors, several correlation methods could be used. One such method is Pearson's correlation coefficient.

Neighborhood based algorithms  is the standard deviation, a is the subscript for active user, u for the user considered as neighbor. After the similarity weights are found, one needs to select the most similar users and generate a prediction. The neighborhood used in prediction can be selected in many ways: Top-n method Thresholding method

Neighborhood based algorithms After selecting the neighbors to be considered, you weigh these users and generate a prediction. Z-scores are used to normalize the ratings.

Cluster based algorithms The naive neighborhood based algorithm is computationally too complex. It is O(mn) where m and n are number of items and users respectively. In clustering approach, if you have constant number of clusters, the complexity is O(m). It is easier to compute the predictions for new users. Details are given next.

Cluster based algorithms Users are members of clusters. Clusters can be formed using many different algorithms, described in detail in the paper by Jain et al., Data Clustering, a review [7]. The goal is to group together similar users and use these clusters in choosing the neighborhood of the active user. Very efficient, scalable, easy to update. If the number of clusters = n, then it degrades into the neighborhood based algorithm. There are accuracy considerations.

Cluster based algorithms If you choose the number of clusters to be small, your predictions get worse. You have a trade-off of speed and accuracy. Best method is to use empirical methods in determining the best cluster size and number.

CF Metrics The two main metrics for CF algorithms are accuracy and complexity. For the accuracy MAE is used frequently. The absolute errors are averaged to find this value. For the complexity, one can use the big-oh metric. Other qualities are also important for predictions: These are: Coverage, novelty and serendipity, confidence and user feedback.

CF Metrics Coverage refers to the percent of the movies is the system able to make prediction. Serendipity and novelty refers to the novel recommendations made by the recommender. Confidence is the value of how confident the system is while making a recommendation. User feedback is important in fine tuning the system so it should be used also.

Conclusion CF is already in use on the Internet, although its history only dates back several years. It still has development potential. Offers great improvements to user enjoyment. Thanks for your attention. Any questions?

References [1] May 2001 issue of the Scientific American: http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21 [2] For more information about the Web 2.0, see the wikipedia article at: http://en.wikipedia.org/wiki/Web_2.0 [3] Jon Herlocker, Joseph Konstan, John Riedl. An empirical analysis of design choices in neighborhood-based algorithms. Information Retrieval , 2002. [4] Jon Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative filtering. SIGIR'99. [5] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time CF algorithm. [6] Al Manumur Rashid, Shyong K. Lam, George Karypis, and John Riedl. ClustKNN: A highly scalable hybrid model & Memory based algorithm. WEBKDD'06, 2006. [7] A. K. Jain, M. N. Murty, P. J. Flynn. Data Clustering: a review. ACM Computing Survey 1999.

Collaborative Filtering

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Collaborative Filtering (20)

Recently uploaded (20)

Collaborative Filtering