June 2014 HUG: Interactive analytics over hadoop

Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 1 8 , 2 0 1 4
4 5 t h B a y A r e a H U G , S u n n yv a l e , C a l i f o r n i a

Interactive – How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve an ad and get insights
< 2s

Motivation
Approach
Problem Deepdive- Instant Overlap
Summary
Questions

Lots of data
Analytics
Data restatement - batch and real time
Human time

Lots of data
~30B advertising events/day
~10s of TB of compressed data/day
Minutes to Year Grain
Multi-quarter data retention
Data Aging

Analytics
Reporting Metrics
Attribution
Multi-level hierarchical computation
Bidding/Targeting optimization
Non-additive computation

Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no lookups
high latency path,
checks and
reconciliations,
can have lookbacks
and lookups

Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ( < 2s)

Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope

Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Batch processing DAG, Real-time topology, SOX,
Traffic protection, Late processing, Retention,
Completeness Monitoring, PII cleansing/masking
Compatible with HDFS, Performance (Indexed,
Columnar, Compression, Serialization, Flexibility,
Concurrency, Grain of data stored)
Distributed/Stand-alone, Caching objects vs caching
results
Access to data with group by, order by etc..; SQL or SQL
like
Translate JSON to SQL(optional)
Logical View - Characteristics
Impacts
Out of scope

Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident)
Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala
memcached_y, Redis
JSON-REST API ; JDBC; ODBC
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Logical View - Choices
Impacts
Out of scope

How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Store
JSON-API Persistence and run-time compute
Computation and Ingestion
Quick cache ( using a database for now)
Upstream: API layer, MSTR,
Adhoc access, Identity Service,
Ad-Serving manifests
Data Producers; Serving,
Scoring, Booking, 3rd/1st Party
Data
Real time and batch compute engine
(Hadoop/Storm )
Data filtering/transformations:
Transformations, format
conversions
Custom Algorithms : computing
recursive uniques, indexing

Human time, How?
Druid for interactive queries
Storm-Druid for quick ingestion and
index
Specialized computation and
processing for quicker response
› Sketches
› Feature sequence based overlaps
› Custom indexing

Problem Deepdive: Instant Overlap

Users
Car
commuters
Soccer
Fans
Vegans

Users
Car
Commuters
Soccer
Fans
Vegans

Overlap
Non-additive
› Require access to raw (user level data) to compute
non-additive
• Billions of events a day
• TBs of data a day
 1-1 vs 1-n vs few-n
› Between car commuter and vegan what is the overlap
› For Car commuter which are the top overlap groups
› For Vegan, Car commuters what are the top overlap
groups

Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
Overlap is like a AND operation or a set

Existing Approaches
● Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
 Hive, PIG, MR can all support efficient joins
 Exact but not real time
● Use sketches
o Approximate algorithms
 HLL, KMV, accuracy vs size, performance
 Approx, needs high perf tuning
 close to real time but not exact

Using Feature Sequences – 1/4
Feature sequence encoding
o Encode the sequence
 {Ram} - { car commuter, soccer fan,...}
 {Tom} - { soccer fan, vegan...}
 {Sam} - { car commuter, soccer fan, vegan...}
 ….

Eliminate the user on encoded bitmaps
 {car commuter, soccer fan, vegan...}- count -c1- #
 {soccer fan, vegan...} - count - c2 - #
 {car commuter, vegan...} - count - c2 - #
Counts become additive now

● Store row qualifications into a bitmap
o Car commuter- Row1, Row3
 1010000000
o Vegan - Row1, Row2, Row3
 1110000000
● Load the bitmap into Druid using a
custom indexer
o in-memory or memory mapped

 Data Structures
› {feature_sequence}->count
› Feature->row qualification bitmaps
 AND is now an “AND” on bitmaps
› supported within Druid
› Very efficient
 Works alongside topN and
groupBys

Comparison with existing algorithm
● 1-n – Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( <1s)
● Instant Overlap
o < 60s ( pre-processing 3-4 hours)
o Supports “exact” AND
o Flexible ( few-n, 1-n)
o 1-1 ( < 1s)

Summary
● Yahoo’s Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optimizations with specialized
algorithms
o Data restatement - batch and realtime
o Human time

Thank You
@supreeth_
@_skgupta
We are hiring!
Reach out to us at
bigdata@yahoo-inc.com.

Dimension Flexibility
Many dimensions
Adding new dimensions
Time zones
Time grain

Normalized view across systems
PaidSearch
Display
Native
Programmatic buying and
selling
Ad-targeting

Hardware Configs
●High-memory boxes
●SSD preferred
●Savings due to better
compression

June 2014 HUG: Interactive analytics over hadoop

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to June 2014 HUG: Interactive analytics over hadoop (20)

More from Yahoo Developer Network (20)

Recently uploaded (20)

June 2014 HUG: Interactive analytics over hadoop

Editor's Notes