SlideShare a Scribd company logo
Collective Intelligence (CI): Defined The intelligence that’s extracted out from the collective set of interactions and contributions made by your users. The use of this intelligence to act as a filter for what’s valuable in your application for a user Source: Alag, S.  Collective Intelligence in Action . Manning Press (2009)
Collective Intelligence: Explicit Resources
Ways to Harness CI Source: Alag, S.  Collective Intelligence in Action . Manning Press (2009)
CI Requirements You need to: Allow users to interact with your site and with each other, learning about each user through their interactions and contributions. Aggregate what you learn about your users and their contributions using some useful models. Leverage those models to recommend relevant content to a user (yielding higher retention & completion rates) Source: Alag, S.  Collective Intelligence in Action . Manning Press (2009)
Forms of CI Data Data comes in two forms:  structured  data and  unstructured  data.  Structured data has a well defined form, something that makes it easily stored and queried on.( e.g. user ratings, content articles viewed, and items purchased …) Unstructured data is typically in the form of raw text (e.g. reviews, discussion forum posts, blog entries, and chat sessions …) Most applications transform unstructured data into structured data Source: Alag, S.  Collective Intelligence in Action . Manning Press (2009)
CI Data Model Most applications generally consist of  users  and  items. An item is any entity of interest in your application. If your application is a social-networking application, or you’re looking to connect one user with another, then a user is also a type of item. Source: Alag, S.  Collective Intelligence in Action . Manning Press (2009) Users Metadata Items
Classification of Recommender Engines
Non-Personalized Collaboration Non-personalized recommendations are identical for each user. The recommendations are either manually selected (e.g. editor choices) or based on the popularity of items (e.g. average ratings, sales data).
Non-Personalized: Example
Non-Personalized: Example
Demographic Recommendation The users are categorized based on the attributes of their demographic profiles in order to find users with similar features. The engine then suggests or recommends (explicitly or implicitly) items that are preferred by these similar users.
Demographic Recommendation: Example
Demographic Recommendation: Example
Demographic Recommendation: Mystery Movie
Demographic Recommendation: Guilt by Association  Advantages New users can get recommendations before they have rated any item. Technique is domain independent Limitations Gathering the required demographic data leads to privacy issues. Demographic classification is too crude for highly personalized recommendations. Users with an unusual taste may not get good recommendations (“gray sheep” problem). Once established user preferences do not change easily (stability vs. plasticity problem).
Collaborative Filtering Employs user-item ratings (or votes) as their information source. The concept is to make correlations between users or between items. The correlations are used to predict user behavior and make recommendations.  Widely implemented and the most mature  recommendation technique.
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Main Approaches User-based Item-based Model-based
Collaborative Filtering: User-Based Assumption that users that rated the same items similarly probably have the same taste.  It make user-to-user correlations by using the rating profiles of different users to find highly correlated users.  These users form like-minded  neighborhoods based on their shared item preferences.  The engine then can recommend the items preferred by the other users in the neighborhood. j 1 j 2 j 3 j 4 j 5 i u 1 2 3 4 5 ? v 1 1 2 3 4 5 5 v 2 5 4 3 2 1 1
Collaborative Filtering: Similarity? Mathematical concept analogous to the notion of Euclidean Distance 1 2 3 0 1 2 3 A B C 1 2.24
Collaborative Filtering: Similarity? Cosine Correlation Adjusted Cosine
Collaborative Filtering:  Cosine Similarity (an Example) Step 1: Find SQRT of Sum of Squares Each Row of Scores Step 2: Divide  each Scores In row by SQRT of Sum of SQs Step3: Calculate Cosine  Similarity Between Users by Summing X-Products of their normalized Scores (from Step 2)
Collaborative Filtering: User-Based Predictions and Recommendations
Collaborative Filtering: User-Based Disadvantages Cold Start: What do you do with users who have no or few ratings? Sparcity: What do you do if there is little overlap in user ratings across users in the data set? Scale: What if there are millions of users? Does this scale well as the number of comparisons increases? Real-time: How do you do these  calculations in real-time.
Collaborative Filtering: Item-Based Example www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf Amazon.com has more than 29 million customers and several million catalog items. Other major retailers have comparably large data sources. While all this data offers opportunity, it’s also a curse, breaking the backs of algorithms designed for data sets three orders of magnitude smaller. Almost all existing algorithms were evaluated over small data sets.
Collaborative Filtering: Item-Based Assumes that items rated similarly are probably similar.  Compares items based on the shared appreciation of users, in order to create neighborhoods of similar items.  The engine then recommends the neighboring items of the user’s known preferred ones.  Item-based:  i  similar to  j 5  more than other items Predict ? = 5 j 1 j 2 j 3 j 4 j 5 i u 1 2 3 4 5 ? v 1 1 2 3 4 5 5 v 2 5 4 3 2 1 1
Collaborative Filtering: Item-Based Example
Collaborative Filtering: Item-Based Advantages Scalable: More scalable than the user-based approach because correlations are drawn among a limited number of products, instead of a potentially very large number of users.  Sparcity: Because the number of items is naturally smaller than the number of users, the item-based approach has a reduced sparsity problem in comparison to the user-based approach.
Collaborative Filtering: There's Money in CF – The Netflix Prize
Collaborative Filtering: Netflix Prize
Collaborative Filtering: Group Lens Rating Data Sets for Testing MovieLens , Wikilens (Beers), Book-Crossing, Jester Joke, HP EachMovie
Collaborative Filtering: Other Applications Anything that can be represented in matrix form where n is a number representing a nominal (e.g. 0,1 for present, absent), ordinal, interval or ratio value
CI from Content: Text Mining Defined Text mining  (also known as text data mining or knowledge discovery in textual databases) is the semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources. Information extraction.  Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. Topic tracking.  Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. Summarization.  Summarizing a document to save time on the part of the reader. Categorization.  Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Clustering.  Grouping similar documents without having a predefined set of categories. Concept linking.  Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Question answering .  Finding the best answer to a given question through knowledge-driven pattern matching.
CI from Content: Resources
CI from Content: Resources Ronen Feldman, “Information Extraction: Theory and Practice,” Bar-Ilan University, ISRAEL, u.cs.biu.ac.il/~feldman/icml_tutorial.html Seth Grimes, “Text Analytics for BI/DW Practitioners.” altaplana.com/TDWI2008Aug-TextAnalyticsForBIDWPractitioners.pdf Bing Liu, “Opinion Mining & Summarization – Sentiment Analysis.”  April 2008. cs.uic.edu/~liub/FBS/opinion-mining-sentiment-analysis.pdf
CI from Content: Some Interesting Data Sets for Research and Training Natural Language Toolkit (NLTK):  nltk.org: Diverse set of “corpora,” used in conjunction with  Natural Language Processing with Python . Data set description: nltk.googlecode.com/svn/trunk/nltk_data/index.xml Enron Email 500K Enron emails sent primarily by sr. managers over a 3.5 year period covering the height of the scandal.  There are multiple versions of the set including database versions. cs.cmu.edu/~enron/ 9/11 Pager Messages Approximately 500K messages sent in and around WTC area before, during, and after the attacks 911.wikileaks.org/release/messages.zip Web Site APIs Del.icio.us Technorati Twitter Web Sites Devoted to Data Sets http://www.datawrangling.com/some-datasets-available-on-the-web http://blog.jonudell.net/2007/07/05/show-me-the-data/
CI from Content: Text Mining Process
CI from Content: Preparing Data for Term Document Matrix  Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the). Stemming — Convert the terms into their stemmed form—remove plurals.
CI for Unstructured Contents:  Analyzing Blogs
CI for Unstructured Contents:  Analyzing Blogs (RSS Feed)
CI for Unstructured Contents:  Analyzing Blogs (Source of RSS Feed)
CI for Unstructured Contents: Structure of an “Atom” Feed
CI for Unstructured Contents: Preparing Blog RSS Feed for Analysis … Access & Parse Feed Retrieve contents of each entry Collection of Entry Contents (HTML) for each Blog Normalize Remove stop words Stem List of word stems for each entry Compute the word/stem counts for each word in each collection Select words for  analysis based on word counts Matrix of Word  counts by Blog Subset of Words for Analysis Collection of Blogs RSS Feed Entry1 Entry2 Compute word counts for each Blog by summing word counts across entries Matrix of Word counts for each Blog
CI for Unstructured Contents: Word Counts for Collection of Blogs
CI from Content: Data Mining applied to Prepared Text Data http://www.kdnuggets.com/index.html?lg
CI for Unstructured Contents:  Blog Dendogram
CI for Unstructured Contents:  Blog Results for K-Means Clustering
CI from Content: Simple Example “We Feel Fine” Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and look to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner.  Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
CI from Content: Simple Example “We Feel Fine” Visualizations Madness Murmerings Montage Mounds Metrics Mobs
CI from Content: 9/11 Pager Data 2001-09-11 08:52:46 Skytel [002386438] B  ALPHA  Netdesk@nbc.com||Reports of a plane crash near World Trade Center - no more details at this point.  WNBC's LIVE pix - Network working on coverage.
CI from Content: 9/11 Pager Data
Dataveillance: Roger Clarke The systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons. The terms personal surveillance and mass surveillance are commonly used, but seldom defined Personal surveillance is the surveillance of an identified person. In general, a specific reason exists for the investigation or monitoring.  Mass surveillance is the surveillance of groups of people, usually large groups. In general, the reason for investigation or monitoring is to identify individuals who belong to some particular class of interest to the surveillance organization.
Dataveillance: Resources
Dataveillance Data Mining & Social Network Analysis ChoicePoint (17B records) Acxiom Equifax (400M credit holders) Experian … Internet & Other Communication Data Sources
Issues with Privacy and Dataveillance Worldwide Privacy Protection There is a tangled matrix of laws and regulations world-wide governing the privacy and protection of this data Anytime we interaction on the Web were likely to cross a number of jurisdictions Widely held belief that our data produced from our activities is protected. Widely held belief among internet users that it’s hard to identify or link specific traces & trails with specific Law enforcement and intelligence agencies (worldwide) are persistent in the requests for internet and communications data
Re-Identifiability of Information Deals with the linkage of datasets without explicit identifiers such as name and address. Examples of Re-identification Large portion of the US can be re-identified using a combination of 5-digit ZIP code, gender and date of birth. AOL case 4417749 (2006 release of 20 million search queries of over 650,000 users CMU study of predicting SSNs -- it is possible to guess many -- if not all -- of the nine digits in an individual's Social Security number using publicly available information (about location and birth date)

More Related Content

PDF
International Journal of Engineering Research and Development
PDF
Guide to Recommender Systems
PPTX
Recommenders Systems
PDF
Recommendation System Using Social Networking
PPTX
Social recommender system
PPT
Developing a Secured Recommender System in Social Semantic Network
PPTX
Recommendation Systems Basics
PDF
Introduction to Recommendation Systems
International Journal of Engineering Research and Development
Guide to Recommender Systems
Recommenders Systems
Recommendation System Using Social Networking
Social recommender system
Developing a Secured Recommender System in Social Semantic Network
Recommendation Systems Basics
Introduction to Recommendation Systems

What's hot (18)

PDF
A Survey on Decision Support Systems in Social Media
DOC
WORD
PPT
Social media recommendation based on people and tags (final)
PDF
Information Retrieval Models for Recommender Systems - PhD slides
PDF
Overview of recommender system
PDF
Ac02411221125
PPTX
Movie lens recommender systems
PDF
Df32676679
PPT
Social Recommender Systems
PPT
Recommender systems
PDF
Semantics-aware Content-based Recommender Systems
PDF
Prediction of Reaction towards Textual Posts in Social Networks
PDF
FIND MY VENUE: Content & Review Based Location Recommendation System
PDF
Social Data Mining
PDF
Recent advances in deep recommender systems
PDF
Alluding Communities in Social Networking Websites using Enhanced Quasi-cliqu...
DOCX
TAG BASED IMAGE SEARCH BY SOCIAL RE-RANKING
PDF
Harvesting Intelligence from User Interactions
A Survey on Decision Support Systems in Social Media
WORD
Social media recommendation based on people and tags (final)
Information Retrieval Models for Recommender Systems - PhD slides
Overview of recommender system
Ac02411221125
Movie lens recommender systems
Df32676679
Social Recommender Systems
Recommender systems
Semantics-aware Content-based Recommender Systems
Prediction of Reaction towards Textual Posts in Social Networks
FIND MY VENUE: Content & Review Based Location Recommendation System
Social Data Mining
Recent advances in deep recommender systems
Alluding Communities in Social Networking Websites using Enhanced Quasi-cliqu...
TAG BASED IMAGE SEARCH BY SOCIAL RE-RANKING
Harvesting Intelligence from User Interactions
Ad

Viewers also liked (20)

PPT
Lizeth And Jenny`S Project
KEY
Pack the Park
PPTX
E-Business 17 02
PDF
Particel physics exercise 4A answers
PPSX
Lead & manage people 23 jan 14 show
PPT
Понятие о ДО
PPT
Poem Comedy Hum Anadi
PDF
позвольте вашему опыту говорить за вас
PPT
De Ander Het Levenslied
PPT
Wealth Transfer
PDF
Amazonda
PDF
Med Em It Campaign Trauma della caviglia e Ottawa Ankle Rules
PPT
Mondi virtuali, numeri e prospettive
PPT
Adm Team3 Duo 기말 Final V.2.0 091201
PDF
Toronto Real Estate Board Housing Market_Charts-December_2011
XLS
Protocolo De Entrega Do Planejamento
PPS
Fed up with your job?
Lizeth And Jenny`S Project
Pack the Park
E-Business 17 02
Particel physics exercise 4A answers
Lead & manage people 23 jan 14 show
Понятие о ДО
Poem Comedy Hum Anadi
позвольте вашему опыту говорить за вас
De Ander Het Levenslied
Wealth Transfer
Amazonda
Med Em It Campaign Trauma della caviglia e Ottawa Ankle Rules
Mondi virtuali, numeri e prospettive
Adm Team3 Duo 기말 Final V.2.0 091201
Toronto Real Estate Board Housing Market_Charts-December_2011
Protocolo De Entrega Do Planejamento
Fed up with your job?
Ad

Similar to Digital Trails Dave King 1 5 10 Part 2 D3 (20)

PPTX
Lecture Notes on Recommender System Introduction
PPTX
Web usage mining
PDF
IntroductionRecommenderSystems_Petroni.pdf
PDF
Book Recommendation Engine
PDF
Mechanical Librarian
PPT
Lec7 collaborative filtering
KEY
Recommender Engines
PPT
Filtering content bbased crs
PPTX
Recommender systems: Content-based and collaborative filtering
PDF
Recommendation System Explained
PDF
A survey of memory based methods for collaborative filtering based techniques
PDF
Time-Ordered Collaborative Filtering for News Recommendation
PPTX
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
PPSX
Zaffar+Ahmed+ +Collaborative+Filtering
PPT
Introduction to recommendation system
PPT
recommender-systems-collaborative-filtering.ppt
PDF
Notes on Recommender Systems pdf 2nd module
PDF
Collaborative filtering- Recommendation system
PPT
Collab filtering-tutorial
PPT
Chapter 02 collaborative recommendation
Lecture Notes on Recommender System Introduction
Web usage mining
IntroductionRecommenderSystems_Petroni.pdf
Book Recommendation Engine
Mechanical Librarian
Lec7 collaborative filtering
Recommender Engines
Filtering content bbased crs
Recommender systems: Content-based and collaborative filtering
Recommendation System Explained
A survey of memory based methods for collaborative filtering based techniques
Time-Ordered Collaborative Filtering for News Recommendation
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
Zaffar+Ahmed+ +Collaborative+Filtering
Introduction to recommendation system
recommender-systems-collaborative-filtering.ppt
Notes on Recommender Systems pdf 2nd module
Collaborative filtering- Recommendation system
Collab filtering-tutorial
Chapter 02 collaborative recommendation

More from Dave King (12)

PPTX
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
PPTX
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
PPTX
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
PPTX
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
PPTX
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
PPTX
Social media mining hicss 46 part 2
PPTX
Social media mining hicss 46 part 1
PDF
Mining and analyzing social media hicss 45 tutorial – part 2
PDF
Mining and analyzing social media hicss 45 tutorial – part 1
PPTX
Text mining and analytics v6 - p1
PPTX
Text mining and analytics v6 - p2
PPT
Digital Trails Dave King 1 5 10 Part 1 D3
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Social media mining hicss 46 part 2
Social media mining hicss 46 part 1
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 1
Text mining and analytics v6 - p1
Text mining and analytics v6 - p2
Digital Trails Dave King 1 5 10 Part 1 D3

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Digital Trails Dave King 1 5 10 Part 2 D3

  • 1. Collective Intelligence (CI): Defined The intelligence that’s extracted out from the collective set of interactions and contributions made by your users. The use of this intelligence to act as a filter for what’s valuable in your application for a user Source: Alag, S. Collective Intelligence in Action . Manning Press (2009)
  • 3. Ways to Harness CI Source: Alag, S. Collective Intelligence in Action . Manning Press (2009)
  • 4. CI Requirements You need to: Allow users to interact with your site and with each other, learning about each user through their interactions and contributions. Aggregate what you learn about your users and their contributions using some useful models. Leverage those models to recommend relevant content to a user (yielding higher retention & completion rates) Source: Alag, S. Collective Intelligence in Action . Manning Press (2009)
  • 5. Forms of CI Data Data comes in two forms: structured data and unstructured data. Structured data has a well defined form, something that makes it easily stored and queried on.( e.g. user ratings, content articles viewed, and items purchased …) Unstructured data is typically in the form of raw text (e.g. reviews, discussion forum posts, blog entries, and chat sessions …) Most applications transform unstructured data into structured data Source: Alag, S. Collective Intelligence in Action . Manning Press (2009)
  • 6. CI Data Model Most applications generally consist of users and items. An item is any entity of interest in your application. If your application is a social-networking application, or you’re looking to connect one user with another, then a user is also a type of item. Source: Alag, S. Collective Intelligence in Action . Manning Press (2009) Users Metadata Items
  • 8. Non-Personalized Collaboration Non-personalized recommendations are identical for each user. The recommendations are either manually selected (e.g. editor choices) or based on the popularity of items (e.g. average ratings, sales data).
  • 11. Demographic Recommendation The users are categorized based on the attributes of their demographic profiles in order to find users with similar features. The engine then suggests or recommends (explicitly or implicitly) items that are preferred by these similar users.
  • 15. Demographic Recommendation: Guilt by Association Advantages New users can get recommendations before they have rated any item. Technique is domain independent Limitations Gathering the required demographic data leads to privacy issues. Demographic classification is too crude for highly personalized recommendations. Users with an unusual taste may not get good recommendations (“gray sheep” problem). Once established user preferences do not change easily (stability vs. plasticity problem).
  • 16. Collaborative Filtering Employs user-item ratings (or votes) as their information source. The concept is to make correlations between users or between items. The correlations are used to predict user behavior and make recommendations. Widely implemented and the most mature recommendation technique.
  • 22. Collaborative Filtering: Main Approaches User-based Item-based Model-based
  • 23. Collaborative Filtering: User-Based Assumption that users that rated the same items similarly probably have the same taste. It make user-to-user correlations by using the rating profiles of different users to find highly correlated users. These users form like-minded neighborhoods based on their shared item preferences. The engine then can recommend the items preferred by the other users in the neighborhood. j 1 j 2 j 3 j 4 j 5 i u 1 2 3 4 5 ? v 1 1 2 3 4 5 5 v 2 5 4 3 2 1 1
  • 24. Collaborative Filtering: Similarity? Mathematical concept analogous to the notion of Euclidean Distance 1 2 3 0 1 2 3 A B C 1 2.24
  • 25. Collaborative Filtering: Similarity? Cosine Correlation Adjusted Cosine
  • 26. Collaborative Filtering: Cosine Similarity (an Example) Step 1: Find SQRT of Sum of Squares Each Row of Scores Step 2: Divide each Scores In row by SQRT of Sum of SQs Step3: Calculate Cosine Similarity Between Users by Summing X-Products of their normalized Scores (from Step 2)
  • 27. Collaborative Filtering: User-Based Predictions and Recommendations
  • 28. Collaborative Filtering: User-Based Disadvantages Cold Start: What do you do with users who have no or few ratings? Sparcity: What do you do if there is little overlap in user ratings across users in the data set? Scale: What if there are millions of users? Does this scale well as the number of comparisons increases? Real-time: How do you do these calculations in real-time.
  • 29. Collaborative Filtering: Item-Based Example www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf Amazon.com has more than 29 million customers and several million catalog items. Other major retailers have comparably large data sources. While all this data offers opportunity, it’s also a curse, breaking the backs of algorithms designed for data sets three orders of magnitude smaller. Almost all existing algorithms were evaluated over small data sets.
  • 30. Collaborative Filtering: Item-Based Assumes that items rated similarly are probably similar. Compares items based on the shared appreciation of users, in order to create neighborhoods of similar items. The engine then recommends the neighboring items of the user’s known preferred ones. Item-based: i similar to j 5 more than other items Predict ? = 5 j 1 j 2 j 3 j 4 j 5 i u 1 2 3 4 5 ? v 1 1 2 3 4 5 5 v 2 5 4 3 2 1 1
  • 32. Collaborative Filtering: Item-Based Advantages Scalable: More scalable than the user-based approach because correlations are drawn among a limited number of products, instead of a potentially very large number of users. Sparcity: Because the number of items is naturally smaller than the number of users, the item-based approach has a reduced sparsity problem in comparison to the user-based approach.
  • 33. Collaborative Filtering: There's Money in CF – The Netflix Prize
  • 35. Collaborative Filtering: Group Lens Rating Data Sets for Testing MovieLens , Wikilens (Beers), Book-Crossing, Jester Joke, HP EachMovie
  • 36. Collaborative Filtering: Other Applications Anything that can be represented in matrix form where n is a number representing a nominal (e.g. 0,1 for present, absent), ordinal, interval or ratio value
  • 37. CI from Content: Text Mining Defined Text mining (also known as text data mining or knowledge discovery in textual databases) is the semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources. Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. Summarization. Summarizing a document to save time on the part of the reader. Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Clustering. Grouping similar documents without having a predefined set of categories. Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Question answering . Finding the best answer to a given question through knowledge-driven pattern matching.
  • 38. CI from Content: Resources
  • 39. CI from Content: Resources Ronen Feldman, “Information Extraction: Theory and Practice,” Bar-Ilan University, ISRAEL, u.cs.biu.ac.il/~feldman/icml_tutorial.html Seth Grimes, “Text Analytics for BI/DW Practitioners.” altaplana.com/TDWI2008Aug-TextAnalyticsForBIDWPractitioners.pdf Bing Liu, “Opinion Mining & Summarization – Sentiment Analysis.” April 2008. cs.uic.edu/~liub/FBS/opinion-mining-sentiment-analysis.pdf
  • 40. CI from Content: Some Interesting Data Sets for Research and Training Natural Language Toolkit (NLTK): nltk.org: Diverse set of “corpora,” used in conjunction with Natural Language Processing with Python . Data set description: nltk.googlecode.com/svn/trunk/nltk_data/index.xml Enron Email 500K Enron emails sent primarily by sr. managers over a 3.5 year period covering the height of the scandal. There are multiple versions of the set including database versions. cs.cmu.edu/~enron/ 9/11 Pager Messages Approximately 500K messages sent in and around WTC area before, during, and after the attacks 911.wikileaks.org/release/messages.zip Web Site APIs Del.icio.us Technorati Twitter Web Sites Devoted to Data Sets http://www.datawrangling.com/some-datasets-available-on-the-web http://blog.jonudell.net/2007/07/05/show-me-the-data/
  • 41. CI from Content: Text Mining Process
  • 42. CI from Content: Preparing Data for Term Document Matrix Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the). Stemming — Convert the terms into their stemmed form—remove plurals.
  • 43. CI for Unstructured Contents: Analyzing Blogs
  • 44. CI for Unstructured Contents: Analyzing Blogs (RSS Feed)
  • 45. CI for Unstructured Contents: Analyzing Blogs (Source of RSS Feed)
  • 46. CI for Unstructured Contents: Structure of an “Atom” Feed
  • 47. CI for Unstructured Contents: Preparing Blog RSS Feed for Analysis … Access & Parse Feed Retrieve contents of each entry Collection of Entry Contents (HTML) for each Blog Normalize Remove stop words Stem List of word stems for each entry Compute the word/stem counts for each word in each collection Select words for analysis based on word counts Matrix of Word counts by Blog Subset of Words for Analysis Collection of Blogs RSS Feed Entry1 Entry2 Compute word counts for each Blog by summing word counts across entries Matrix of Word counts for each Blog
  • 48. CI for Unstructured Contents: Word Counts for Collection of Blogs
  • 49. CI from Content: Data Mining applied to Prepared Text Data http://www.kdnuggets.com/index.html?lg
  • 50. CI for Unstructured Contents: Blog Dendogram
  • 51. CI for Unstructured Contents: Blog Results for K-Means Clustering
  • 52. CI from Content: Simple Example “We Feel Fine” Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and look to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner. Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
  • 53. CI from Content: Simple Example “We Feel Fine” Visualizations Madness Murmerings Montage Mounds Metrics Mobs
  • 54. CI from Content: 9/11 Pager Data 2001-09-11 08:52:46 Skytel [002386438] B ALPHA Netdesk@nbc.com||Reports of a plane crash near World Trade Center - no more details at this point. WNBC's LIVE pix - Network working on coverage.
  • 55. CI from Content: 9/11 Pager Data
  • 56. Dataveillance: Roger Clarke The systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons. The terms personal surveillance and mass surveillance are commonly used, but seldom defined Personal surveillance is the surveillance of an identified person. In general, a specific reason exists for the investigation or monitoring. Mass surveillance is the surveillance of groups of people, usually large groups. In general, the reason for investigation or monitoring is to identify individuals who belong to some particular class of interest to the surveillance organization.
  • 58. Dataveillance Data Mining & Social Network Analysis ChoicePoint (17B records) Acxiom Equifax (400M credit holders) Experian … Internet & Other Communication Data Sources
  • 59. Issues with Privacy and Dataveillance Worldwide Privacy Protection There is a tangled matrix of laws and regulations world-wide governing the privacy and protection of this data Anytime we interaction on the Web were likely to cross a number of jurisdictions Widely held belief that our data produced from our activities is protected. Widely held belief among internet users that it’s hard to identify or link specific traces & trails with specific Law enforcement and intelligence agencies (worldwide) are persistent in the requests for internet and communications data
  • 60. Re-Identifiability of Information Deals with the linkage of datasets without explicit identifiers such as name and address. Examples of Re-identification Large portion of the US can be re-identified using a combination of 5-digit ZIP code, gender and date of birth. AOL case 4417749 (2006 release of 20 million search queries of over 650,000 users CMU study of predicting SSNs -- it is possible to guess many -- if not all -- of the nine digits in an individual's Social Security number using publicly available information (about location and birth date)

Editor's Notes

  • #20: Customers Who Bought On the information page for every item, Amazon shows the “Customers Who Bought” feature that recommends items frequently purchased by customers who purchased the selected item. The feature is also used on the shopping cart page. This works as the equivalent to the impulse items in a supermarket checkout line, but here the impulse items are personalized for each customer.
  • #35: Contest begins October 2, 2006 and continues through at least October 2, 2011 . Contest is open to anyone, anywhere (except certain countries listed below). You have to register to enter. Once you register and agree to these Rules, you’ll have access to the Contest training data and qualifying test sets. To qualify for the $1,000,000 Grand Prize, the accuracy of your submitted predictions on the qualifying set must be at least 10% better than the accuracy Cinematch can achieve on the same training data set at the start of the Contest. To qualify for a year’s $50,000 Progress Prize the accuracy of any of your submitted predictions that year must be less than or equal to the accuracy value established by the judges the preceding year. To win and take home either prize, your qualifying submissions must have the largest accuracy improvement verified by the Contest judges, you must share your method with (and non-exclusively license it to) Netflix, and you must describe to the world how you did it and why it works.