SlideShare a Scribd company logo
Demand, Media, and Search
Analytics




Sean Timm
sean.timm@teamaol.com
Twitter: @timmsc
October 4, 2011
Introduction
• Who am I?
• What do we use Hadoop for?
• Our best practices
• Lessons learned
• The related searches, seasonality—example applications




                                                      Page 2
History
• Originated in Search Backend in 2007
• Create data driven products for search.aol.com from search
  logs
• No Netezza experience, decided to try Hadoop
• Took 3 weeks to write simple aggregation
• Apache Pig 0.3—2 days
• First product, related searches, launched in 2008
• Search breaking trends product led to further demand work
• Now Pig 0.8.1 and Hadoop 0.20.2




                                                       Page 3
Data
Hourly search.aol.com logs
•5 M log lines of data per hour
•Logs include searches, clicks, and other data
•70% of queries we only see once


Hourly Wikipedia page view data
•public data set http://dammit.lt/wikistats
•7 M pages viewed per hour
•2.7 M English pages per hour


BeacoN logs
•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL
properties


                                                                  Page 4
We like Pig!
•   Hourly, daily, and monthly search and click aggregation
•   Related searches
•   Auto complete dictionary
•   Mining spelling correction click through
•   Temporal pattern analysis
•   Classifying adult queries and URLs
•   Categorizing queries
•   Identifying queries in the form of a question or superlative
•   Identifying breaking trends in AOL Search and Wikipedia page views
•   Identify queries of local interest
•   Clustering queries using click graph, temporal distance, Carrot2, k-means
•   AOL HPMG stats and trends for page views, authors, tags, etc.



                                                                     Page 5
Pig Process in General
Script run time < 2 minutes to > 2 hours


Ad hoc…wild west


Complex shell scripts
1. load/copy/backup data
2. Launch multiple Pig scripts—some in parallel—some
   with serial dependencies
3. Check for errors—e-mail and halt
4. Load data into MySQL, Vertica, or Solr

                                              Page 6
Getting data out of Hadoop
First approach: special StoreFunc to write directly to MySQL/Solr
•Network: Required master be on the same network as the
cluster
•Speculative optimization: data would be written more than once
increasing contention as well as doing unnecessary writes
•Replication: writing to the master in parallel, serial replication
was slow (MySQL)
•Timeouts: occasionally a task failed and restarted (Solr)




                                                          Page 7
Getting Data out of Hadoop
MySQL/Vertica Now
•Write data to HDFS
•Copy from HDFS to local file system using CLI
•Load into database: LOAD DATA LOCAL INFILE from mysql client

Solr Now
•Custom StoreFunc writes Solr XML to HDFS
•Starting with Pig 0.7 fields are named using the Pig schema
•Copy from HDFS to local file system using CLI
•Load into Solr using remote streaming



                                                         Page 8
UDFs
• Use Piggy Bank and builtins when possible
• 89 custom UDFs packaged in a single jar
• Most are simple
   • Validate a URL, URL decode a string, calculate a hash value,

     date math, etc.
• Some are complex
    • Spell check/correct, LOESS regression, Carrot2 clustering,

      FFT, Euclidean distance, etc.




                                                          Page 9
Lessons learned
• Many small categorization scripts, better to use a larger single
  one
• Set priority on large time sensitive jobs that fight for resources
  with other jobs
• Fair scheduler
• Tuning the cluster for maps or reduces
• Don't write copious debug
• Use appropriate number of reducers (PARALLEL)




                                                           Page 10
Related Searches

  Group by Query
Challenges
• Adult terms
• Misspellings
• Breadth of suggestions
• Coverage
• Timeliness of suggestions
Process Flow
• Filter and clean data
    • Block adult terms, long queries, non-alpha, second+ pages,
      operators, URL like queries, search spam
    • Lower case


• Join to get query-related query groups
• Contextual spell correct within group
• Cluster related queries and pick the best from each
  group
• Load into Solr
Related Searches Graph     “The Eagles”

                                Hotel California

   The band




                     NFL
                                            Tribute




                                 Boston College
                                  Page 14
Classification
• Supervised learning
• Provide categorized set of queries and/or URLs
• Calculate a score based on the edge weights
• If the score exceeds a specified threshold the query or URL is
  tagged with the category
Applications Outside of Search
• Author/citation bipartite graph
• Social network graphs
• User/Page view graphs
Temporal traffic correlation of Wikipedia Page Views




                                                Page 17
Tomato Seasonality
May: planting tomatoes, tomato cages, types of tomatoes
June: pruning tomato plants
July: tomato diseases, tomato blight, tomato worm
August: tomato recipes, tomato soup, tomato sauce, tomato salsa
September: sun dried tomatoes, canning and freezing tomatoes
October: green tomato recipes




                                                      Page 18

More Related Content

PPTX
DC presentation 1
KEY
Library Mashups & APIs
PPT
Intro to Solr in Drupal
PDF
RDFa: introduction, comparison with microdata and microformats and how to use it
PDF
A Survey of Elasticsearch Usage
PPTX
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
KEY
State-of-the-Art Drupal Search with Apache Solr
PPTX
How Many Sites Do I Need? (SPSVB)
DC presentation 1
Library Mashups & APIs
Intro to Solr in Drupal
RDFa: introduction, comparison with microdata and microformats and how to use it
A Survey of Elasticsearch Usage
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
State-of-the-Art Drupal Search with Apache Solr
How Many Sites Do I Need? (SPSVB)

What's hot (17)

PPTX
Dude, where does my data go?
PPTX
Pragmatic REST: recent trends in API design
PPTX
Pragmatic REST APIs
KEY
Big Data is frustrating
PPTX
DotNetNuke Urls - Best practice for administrators, editors and developers
PDF
Winning the Big Data SPAM Challenge__HadoopSummit2010
PPT
How To Construct A Search Engine Friendly Website
PDF
LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...
PPTX
Open Source Search FTW
KEY
PDF
Frontera-Open Source Large Scale Web Crawling Framework
PPT
Itct year1 mitchell
PPTX
Google
PPTX
Radicalize Your Library Catalog with Ebooks Your Patrons Can Keep Forever
PDF
Jinchao demo
PPT
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Dude, where does my data go?
Pragmatic REST: recent trends in API design
Pragmatic REST APIs
Big Data is frustrating
DotNetNuke Urls - Best practice for administrators, editors and developers
Winning the Big Data SPAM Challenge__HadoopSummit2010
How To Construct A Search Engine Friendly Website
LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...
Open Source Search FTW
Frontera-Open Source Large Scale Web Crawling Framework
Itct year1 mitchell
Google
Radicalize Your Library Catalog with Ebooks Your Patrons Can Keep Forever
Jinchao demo
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Ad

Similar to Demand, Media, and Search Analytics at AOL (20)

PDF
Introduction To Apache Pig at WHUG
PPTX
Intro to Big Data - Orlando Code Camp 2014
PPTX
03 pig intro
PPTX
TriHUG November Pig Talk by Alan Gates
PPT
AOL - Ian Holsman - Hadoop World 2010
PDF
Sql saturday pig session (wes floyd) v2
PPTX
Introduction to Pig
KEY
The data layer
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PPTX
PPTX
BDA R20 21NM - Summary Big Data Analytics
PPTX
Hive and Pig for .NET User Group
PPTX
AWS Hadoop and PIG and overview
PDF
Practical pig
PPTX
SEO for Large Websites
PPTX
The Hadoop Ecosystem
PDF
B1803040412
PPTX
Intro to hadoop ecosystem
PPTX
Hic 2011 realtime_analytics_at_facebook
PDF
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Introduction To Apache Pig at WHUG
Intro to Big Data - Orlando Code Camp 2014
03 pig intro
TriHUG November Pig Talk by Alan Gates
AOL - Ian Holsman - Hadoop World 2010
Sql saturday pig session (wes floyd) v2
Introduction to Pig
The data layer
Big data, just an introduction to Hadoop and Scripting Languages
BDA R20 21NM - Summary Big Data Analytics
Hive and Pig for .NET User Group
AWS Hadoop and PIG and overview
Practical pig
SEO for Large Websites
The Hadoop Ecosystem
B1803040412
Intro to hadoop ecosystem
Hic 2011 realtime_analytics_at_facebook
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Ad

Recently uploaded (20)

PPTX
TLE Review Electricity (Electricity).pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Hybrid model detection and classification of lung cancer
PDF
project resource management chapter-09.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
1. Introduction to Computer Programming.pptx
TLE Review Electricity (Electricity).pptx
OMC Textile Division Presentation 2021.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Mushroom cultivation and it's methods.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Heart disease approach using modified random forest and particle swarm optimi...
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Hybrid model detection and classification of lung cancer
project resource management chapter-09.pdf
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
Accuracy of neural networks in brain wave diagnosis of schizophrenia
MIND Revenue Release Quarter 2 2025 Press Release
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Enhancing emotion recognition model for a student engagement use case through...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
1. Introduction to Computer Programming.pptx

Demand, Media, and Search Analytics at AOL

  • 1. Demand, Media, and Search Analytics Sean Timm sean.timm@teamaol.com Twitter: @timmsc October 4, 2011
  • 2. Introduction • Who am I? • What do we use Hadoop for? • Our best practices • Lessons learned • The related searches, seasonality—example applications Page 2
  • 3. History • Originated in Search Backend in 2007 • Create data driven products for search.aol.com from search logs • No Netezza experience, decided to try Hadoop • Took 3 weeks to write simple aggregation • Apache Pig 0.3—2 days • First product, related searches, launched in 2008 • Search breaking trends product led to further demand work • Now Pig 0.8.1 and Hadoop 0.20.2 Page 3
  • 4. Data Hourly search.aol.com logs •5 M log lines of data per hour •Logs include searches, clicks, and other data •70% of queries we only see once Hourly Wikipedia page view data •public data set http://dammit.lt/wikistats •7 M pages viewed per hour •2.7 M English pages per hour BeacoN logs •Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL properties Page 4
  • 5. We like Pig! • Hourly, daily, and monthly search and click aggregation • Related searches • Auto complete dictionary • Mining spelling correction click through • Temporal pattern analysis • Classifying adult queries and URLs • Categorizing queries • Identifying queries in the form of a question or superlative • Identifying breaking trends in AOL Search and Wikipedia page views • Identify queries of local interest • Clustering queries using click graph, temporal distance, Carrot2, k-means • AOL HPMG stats and trends for page views, authors, tags, etc. Page 5
  • 6. Pig Process in General Script run time < 2 minutes to > 2 hours Ad hoc…wild west Complex shell scripts 1. load/copy/backup data 2. Launch multiple Pig scripts—some in parallel—some with serial dependencies 3. Check for errors—e-mail and halt 4. Load data into MySQL, Vertica, or Solr Page 6
  • 7. Getting data out of Hadoop First approach: special StoreFunc to write directly to MySQL/Solr •Network: Required master be on the same network as the cluster •Speculative optimization: data would be written more than once increasing contention as well as doing unnecessary writes •Replication: writing to the master in parallel, serial replication was slow (MySQL) •Timeouts: occasionally a task failed and restarted (Solr) Page 7
  • 8. Getting Data out of Hadoop MySQL/Vertica Now •Write data to HDFS •Copy from HDFS to local file system using CLI •Load into database: LOAD DATA LOCAL INFILE from mysql client Solr Now •Custom StoreFunc writes Solr XML to HDFS •Starting with Pig 0.7 fields are named using the Pig schema •Copy from HDFS to local file system using CLI •Load into Solr using remote streaming Page 8
  • 9. UDFs • Use Piggy Bank and builtins when possible • 89 custom UDFs packaged in a single jar • Most are simple • Validate a URL, URL decode a string, calculate a hash value, date math, etc. • Some are complex • Spell check/correct, LOESS regression, Carrot2 clustering, FFT, Euclidean distance, etc. Page 9
  • 10. Lessons learned • Many small categorization scripts, better to use a larger single one • Set priority on large time sensitive jobs that fight for resources with other jobs • Fair scheduler • Tuning the cluster for maps or reduces • Don't write copious debug • Use appropriate number of reducers (PARALLEL) Page 10
  • 11. Related Searches Group by Query
  • 12. Challenges • Adult terms • Misspellings • Breadth of suggestions • Coverage • Timeliness of suggestions
  • 13. Process Flow • Filter and clean data • Block adult terms, long queries, non-alpha, second+ pages, operators, URL like queries, search spam • Lower case • Join to get query-related query groups • Contextual spell correct within group • Cluster related queries and pick the best from each group • Load into Solr
  • 14. Related Searches Graph “The Eagles” Hotel California The band NFL Tribute Boston College Page 14
  • 15. Classification • Supervised learning • Provide categorized set of queries and/or URLs • Calculate a score based on the edge weights • If the score exceeds a specified threshold the query or URL is tagged with the category
  • 16. Applications Outside of Search • Author/citation bipartite graph • Social network graphs • User/Page view graphs
  • 17. Temporal traffic correlation of Wikipedia Page Views Page 17
  • 18. Tomato Seasonality May: planting tomatoes, tomato cages, types of tomatoes June: pruning tomato plants July: tomato diseases, tomato blight, tomato worm August: tomato recipes, tomato soup, tomato sauce, tomato salsa September: sun dried tomatoes, canning and freezing tomatoes October: green tomato recipes Page 18