SlideShare a Scribd company logo
Profiling a Person With  Log Data Jim Jansen College of Information Sciences and Technology  The Pennsylvania State University  [email_address]   Interested in how much  descriptive  information we can generate about a  people  by leveraging  search log data .
What Did We Find Out? We can tell quite a lot!
The State of Web Search
The Power of Search and the Web  Search is  the   top online activity Search drives over  7 billion monthly  queries in the U.S. Online activity has a  huge impact  on people’s daily lives: 70 minutes less with family 30 minutes less TV 8.5 minutes less sleep Sources: comScore, U.S., Feb. ’06, Stanford Institute for the Quantitative Study of Society, Nov. ‘05
Analysis of Search Marketplace  Holding  fairly stable  over the last year or so, albeit with some  Bing flux
Search Logs Contains the  trace data  recorded when a person visits the search engine, submits a query, views results, etc On one hand, logs have been  criticized   for  not being rich enough  (i.e., only have behaviors but  not  the  ‘why ’ factors) On the other hand, logs have been  criticized  for  recording too much  about us (i.e., logging a lot of  personal  information about a person) search logs How much we can  learn  about a person from the data stored in search logs? Specifically, how rich of a searcher profile can we build of  what  a person is doing, of  why  they are doing it, and to  predict  what are they going to do next?
An illustrative example
How much can we tell from a single query?  ASIS&T  is an acronym for the American Society of Information Science and Technology  Good  probability  that this user is an  academic , a researcher, a librarian, or a student in one of these disciplines  Leveraging  demographic information : 57 percent female / 43 percent male probability  66.2 percent chance works in the information science field 55.6 percent probability this user has master’s degree
How much can we tell from a single query?  Leveraging  demographic information  (cont’d): 32.3 percent probability this user has a doctorate 53 percent likelihood works in academia.  Using  IP , we can locate the geographical area Based on  time , could infer that: this person is searching for the conference’s schedule (if the query is submitted prior to the meeting) for travel or looking for presentations or papers from the meeting (if the query is submitted after the conference).  Theoretically,  we can tell a lot ! However, with  billions of queries  per month, we can’t do the analysis  by hand  like this example. To develop user profiles, we need  automated methods . Research Question -  How complete of a  profile  can one develop for a Web search engine  user  from search  log  data?  [(a) what the user is doing, (b) what the user is interested in, and (c) what the user intends to do]
Specific aspects with automated methods …  Location  Geographical interest Topical interest Topical complexity Content desires Commercial intent Purchase intent Potential to click on a link Gender User identification –  where the user is at –  where the user is going –  what the user is interested in –  how motivated is the user –  Info, Nav, Transactional –  eCommerce related –  getting ready to buy –  will user click on link - demographic targeting/personalization - specific user targeting –  IP look-up script –  query term usage –  tools like Open Calais –  n-grams pattern analysis –  binary tree, k-mans clustering –  tools like MSN adLabs –  session analysis –  time series analysis - tools like MSN adLabs (need a whole lot of data)
A comment about user identification  we can tell a lot  about  a person within a group of people with search logs (i.e., behaviors) … … identifying  a particular individual is much more difficult with just search logs (probably takes ~12 – 18 months of data). Given a group of folks who use a search engine, …
User Profiling Framework  Classify user aspects into two levels:  internal  and  external .  Internal  aspects refer to  attributes  of the users themselves.  External  aspects relate to the  behavior or interest  of the users.  Interaction  between  internal  and  external  aspects. Can  infer   external  aspects from  internal  aspects.  External  aspects  reflect   internal  aspects
Thank you! (open for questions and further discussion) Jim Jansen College of Information Sciences and Technology  The Pennsylvania State University  [email_address]
Search Logs has some common fields, such as time, queries, results, etc. We can enrich the log with additional fields. Back Back
Back
Back

More Related Content

PPT
The Use of Query Reformulation to Predict Future User Actions
PPT
Jenkins jr edu600 ip 3 digital research
PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PPT
Information retrieval
PPT
Contextualized online search and research skills
PPTX
Contextualized Online Search and Research Skills
DOCX
1. explain the relationship among data mining, text mining, and sent
PPT
Data Mining of Informational Stream in Social Networks
The Use of Query Reformulation to Predict Future User Actions
Jenkins jr edu600 ip 3 digital research
Data.Mining.C.8(Ii).Web Mining 570802461
Information retrieval
Contextualized online search and research skills
Contextualized Online Search and Research Skills
1. explain the relationship among data mining, text mining, and sent
Data Mining of Informational Stream in Social Networks

What's hot (19)

PDF
50320140501002
PPTX
Our digital traces and how they can be missuseed
PDF
POT
Data mining on Social Media
PPTX
Data mining for social media
PPTX
Ref22: Searchers Academy 2.0 Redux
PDF
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
PDF
Team CDTW Capstone Presentation
PPTX
Influence of Timeline and Named-entity Components on User Engagement
PPTX
Information retrieval system!
PPTX
Data Analytics Capstone
PPTX
Sem tech2013 tutorial
PPT
Neigh october2012
PDF
Secondary source qual
PDF
Data mining in social network
PDF
Analyzing-Threat-Levels-of-Extremists-using-Tweets
PDF
How to be successful with search in your organisation
PPTX
CRJS250 Carsuso Criminology Research Paper Guide
50320140501002
Our digital traces and how they can be missuseed
Data mining on Social Media
Data mining for social media
Ref22: Searchers Academy 2.0 Redux
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
Team CDTW Capstone Presentation
Influence of Timeline and Named-entity Components on User Engagement
Information retrieval system!
Data Analytics Capstone
Sem tech2013 tutorial
Neigh october2012
Secondary source qual
Data mining in social network
Analyzing-Threat-Levels-of-Extremists-using-Tweets
How to be successful with search in your organisation
CRJS250 Carsuso Criminology Research Paper Guide
Ad

Viewers also liked (20)

PDF
Stormwater Utilities: A regional and national perspective on planning and imp...
PPTX
Performance is the new normal 20120426-preso
PPS
Map of WWII Europe theatre
PPTX
Bni 2013 presentation
PPTX
Sim House Example Dogwood
DOC
Cv L.S.Bhandary Eng
PPT
Sunday Streets Bpag Presentation 1
PDF
Green Stormwater: LID with GIS
PDF
Adventures in freemium
PDF
Cartoons Innovation Dynamics writeshop
PPTX
Indy 2009
PPTX
Linha 0i - Comparativo e opções
PDF
Anais
PPTX
Linha Vivo - Comparativo e opções
PPT
lesson_03 Setting up Adwords Accounts, Adwords, and Selecting Businesses
PDF
Rosa Et Al. 2010
DOCX
I luv hongkong行程终极篇
PPTX
Jjansen networked consumer_2011
PPT
Impressionism
PPT
Cold war (1)
Stormwater Utilities: A regional and national perspective on planning and imp...
Performance is the new normal 20120426-preso
Map of WWII Europe theatre
Bni 2013 presentation
Sim House Example Dogwood
Cv L.S.Bhandary Eng
Sunday Streets Bpag Presentation 1
Green Stormwater: LID with GIS
Adventures in freemium
Cartoons Innovation Dynamics writeshop
Indy 2009
Linha 0i - Comparativo e opções
Anais
Linha Vivo - Comparativo e opções
lesson_03 Setting up Adwords Accounts, Adwords, and Selecting Businesses
Rosa Et Al. 2010
I luv hongkong行程终极篇
Jjansen networked consumer_2011
Impressionism
Cold war (1)
Ad

Similar to Profiling a Person With Search Log Data (20)

PPTX
CSC315_LECTURE on database design and management
PPT
Search Analytics: Diagnosing what ails your site
PPT
Search Analytics for Fun and Profit
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
A survey on various architectures, models and methodologies for information r...
KEY
Search Analytics for Content Strategists
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PDF
PPT
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
PDF
Summary of Paper : Taxonomy of websearch by Broder
PDF
PPT
Search Analytics: Diagnosing what ails your site
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PPT
Search Analytics: Conversations with Your Customers
PPT
Web analytics webinar
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
PDF
G017415465
PPT
Web analytics presentation
PPT
Search Analytics: Powerful diagnostics for your site
PDF
Ac02411221125
CSC315_LECTURE on database design and management
Search Analytics: Diagnosing what ails your site
Search Analytics for Fun and Profit
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
A survey on various architectures, models and methodologies for information r...
Search Analytics for Content Strategists
CS8080_IRT__UNIT_I_NOTES.pdf
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Summary of Paper : Taxonomy of websearch by Broder
Search Analytics: Diagnosing what ails your site
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Search Analytics: Conversations with Your Customers
Web analytics webinar
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
G017415465
Web analytics presentation
Search Analytics: Powerful diagnostics for your site
Ac02411221125

More from Jim Jansen (11)

PPTX
Networked Consumers: How networked and how important?
PPT
Twitter and EWOM Branding
PPT
Lesson_04_ist402_google_adwords_02
PPT
Lesson 15 When Where To Show Your Ads
PPT
Lesson 13 Writing Good Ads 02
PPT
Lesson 11 Writing Good Ads
PPT
Lesson 07 Ist402 Keywords Take 02
PPT
Lesson 06 Ist402 Keywords 02
PPT
Lesson 05 Three Course Requirements
PPT
Ist402 Google Marketing Challenge V02
PPT
What Is Log Analyis
Networked Consumers: How networked and how important?
Twitter and EWOM Branding
Lesson_04_ist402_google_adwords_02
Lesson 15 When Where To Show Your Ads
Lesson 13 Writing Good Ads 02
Lesson 11 Writing Good Ads
Lesson 07 Ist402 Keywords Take 02
Lesson 06 Ist402 Keywords 02
Lesson 05 Three Course Requirements
Ist402 Google Marketing Challenge V02
What Is Log Analyis

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?

Profiling a Person With Search Log Data

  • 1. Profiling a Person With Log Data Jim Jansen College of Information Sciences and Technology The Pennsylvania State University [email_address] Interested in how much descriptive information we can generate about a people by leveraging search log data .
  • 2. What Did We Find Out? We can tell quite a lot!
  • 3. The State of Web Search
  • 4. The Power of Search and the Web Search is the top online activity Search drives over 7 billion monthly queries in the U.S. Online activity has a huge impact on people’s daily lives: 70 minutes less with family 30 minutes less TV 8.5 minutes less sleep Sources: comScore, U.S., Feb. ’06, Stanford Institute for the Quantitative Study of Society, Nov. ‘05
  • 5. Analysis of Search Marketplace Holding fairly stable over the last year or so, albeit with some Bing flux
  • 6. Search Logs Contains the trace data recorded when a person visits the search engine, submits a query, views results, etc On one hand, logs have been criticized for not being rich enough (i.e., only have behaviors but not the ‘why ’ factors) On the other hand, logs have been criticized for recording too much about us (i.e., logging a lot of personal information about a person) search logs How much we can learn about a person from the data stored in search logs? Specifically, how rich of a searcher profile can we build of what a person is doing, of why they are doing it, and to predict what are they going to do next?
  • 8. How much can we tell from a single query? ASIS&T is an acronym for the American Society of Information Science and Technology Good probability that this user is an academic , a researcher, a librarian, or a student in one of these disciplines Leveraging demographic information : 57 percent female / 43 percent male probability 66.2 percent chance works in the information science field 55.6 percent probability this user has master’s degree
  • 9. How much can we tell from a single query? Leveraging demographic information (cont’d): 32.3 percent probability this user has a doctorate 53 percent likelihood works in academia. Using IP , we can locate the geographical area Based on time , could infer that: this person is searching for the conference’s schedule (if the query is submitted prior to the meeting) for travel or looking for presentations or papers from the meeting (if the query is submitted after the conference). Theoretically, we can tell a lot ! However, with billions of queries per month, we can’t do the analysis by hand like this example. To develop user profiles, we need automated methods . Research Question - How complete of a profile can one develop for a Web search engine user from search log data? [(a) what the user is doing, (b) what the user is interested in, and (c) what the user intends to do]
  • 10. Specific aspects with automated methods … Location Geographical interest Topical interest Topical complexity Content desires Commercial intent Purchase intent Potential to click on a link Gender User identification – where the user is at – where the user is going – what the user is interested in – how motivated is the user – Info, Nav, Transactional – eCommerce related – getting ready to buy – will user click on link - demographic targeting/personalization - specific user targeting – IP look-up script – query term usage – tools like Open Calais – n-grams pattern analysis – binary tree, k-mans clustering – tools like MSN adLabs – session analysis – time series analysis - tools like MSN adLabs (need a whole lot of data)
  • 11. A comment about user identification we can tell a lot about a person within a group of people with search logs (i.e., behaviors) … … identifying a particular individual is much more difficult with just search logs (probably takes ~12 – 18 months of data). Given a group of folks who use a search engine, …
  • 12. User Profiling Framework Classify user aspects into two levels: internal and external . Internal aspects refer to attributes of the users themselves. External aspects relate to the behavior or interest of the users. Interaction between internal and external aspects. Can infer external aspects from internal aspects. External aspects reflect internal aspects
  • 13. Thank you! (open for questions and further discussion) Jim Jansen College of Information Sciences and Technology The Pennsylvania State University [email_address]
  • 14. Search Logs has some common fields, such as time, queries, results, etc. We can enrich the log with additional fields. Back Back
  • 15. Back
  • 16. Back