Demand, Media, and Search Analytics at AOL

Demand, Media, and Search
Analytics

Sean Timm
sean.timm@teamaol.com
Twitter: @timmsc
October 4, 2011

Introduction
• Who am I?
• What do we use Hadoop for?
• Our best practices
• Lessons learned
• The related searches, seasonality—example applications

Page 2

History
• Originated in Search Backend in 2007
• Create data driven products for search.aol.com from search
logs
• No Netezza experience, decided to try Hadoop
• Took 3 weeks to write simple aggregation
• Apache Pig 0.3—2 days
• First product, related searches, launched in 2008
• Search breaking trends product led to further demand work
• Now Pig 0.8.1 and Hadoop 0.20.2

Page 3

Data
Hourly search.aol.com logs
•5 M log lines of data per hour
•Logs include searches, clicks, and other data
•70% of queries we only see once

Hourly Wikipedia page view data
•public data set http://dammit.lt/wikistats
•7 M pages viewed per hour
•2.7 M English pages per hour

BeacoN logs
•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL
properties

Page 4

We like Pig!
• Hourly, daily, and monthly search and click aggregation
• Related searches
• Auto complete dictionary
• Mining spelling correction click through
• Temporal pattern analysis
• Classifying adult queries and URLs
• Categorizing queries
• Identifying queries in the form of a question or superlative
• Identifying breaking trends in AOL Search and Wikipedia page views
• Identify queries of local interest
• Clustering queries using click graph, temporal distance, Carrot2, k-means
• AOL HPMG stats and trends for page views, authors, tags, etc.

Page 5

Pig Process in General
Script run time < 2 minutes to > 2 hours

Ad hoc…wild west

Complex shell scripts
1. load/copy/backup data
2. Launch multiple Pig scripts—some in parallel—some
with serial dependencies
3. Check for errors—e-mail and halt
4. Load data into MySQL, Vertica, or Solr

Page 6

Getting data out of Hadoop
First approach: special StoreFunc to write directly to MySQL/Solr
•Network: Required master be on the same network as the
cluster
•Speculative optimization: data would be written more than once
increasing contention as well as doing unnecessary writes
•Replication: writing to the master in parallel, serial replication
was slow (MySQL)
•Timeouts: occasionally a task failed and restarted (Solr)

Page 7

Getting Data out of Hadoop
MySQL/Vertica Now
•Write data to HDFS
•Copy from HDFS to local file system using CLI
•Load into database: LOAD DATA LOCAL INFILE from mysql client

Solr Now
•Custom StoreFunc writes Solr XML to HDFS
•Starting with Pig 0.7 fields are named using the Pig schema
•Copy from HDFS to local file system using CLI
•Load into Solr using remote streaming

Page 8

UDFs
• Use Piggy Bank and builtins when possible
• 89 custom UDFs packaged in a single jar
• Most are simple
• Validate a URL, URL decode a string, calculate a hash value,

date math, etc.
• Some are complex
• Spell check/correct, LOESS regression, Carrot2 clustering,

FFT, Euclidean distance, etc.

Page 9

Lessons learned
• Many small categorization scripts, better to use a larger single
one
• Set priority on large time sensitive jobs that fight for resources
with other jobs
• Fair scheduler
• Tuning the cluster for maps or reduces
• Don't write copious debug
• Use appropriate number of reducers (PARALLEL)

Page 10

Demand, Media, and Search Analytics at AOL

More Related Content

What's hot (17)

Similar to Demand, Media, and Search Analytics at AOL (20)

Recently uploaded (20)

Demand, Media, and Search Analytics at AOL