SlideShare a Scribd company logo
Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 1 8 , 2 0 1 4
4 5 t h B a y A r e a H U G , S u n n yv a l e , C a l i f o r n i a
Interactive – How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve an ad and get insights
< 2s
Agenda:
Motivation
Approach
Problem Deepdive- Instant Overlap
Summary
Questions
Motivation:
Lots of data
Analytics
Data restatement - batch and real time
Human time
Lots of data
~30B advertising events/day
~10s of TB of compressed data/day
Minutes to Year Grain
Multi-quarter data retention
Data Aging
Analytics
Reporting Metrics
Attribution
Multi-level hierarchical computation
Bidding/Targeting optimization
Non-additive computation
Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no lookups
high latency path,
checks and
reconciliations,
can have lookbacks
and lookups
Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ( < 2s)
Lots of data
Analytics
Data restatement - batch and real time
Human time
Approach:
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Batch processing DAG, Real-time topology, SOX,
Traffic protection, Late processing, Retention,
Completeness Monitoring, PII cleansing/masking
Compatible with HDFS, Performance (Indexed,
Columnar, Compression, Serialization, Flexibility,
Concurrency, Grain of data stored)
Distributed/Stand-alone, Caching objects vs caching
results
Access to data with group by, order by etc..; SQL or SQL
like
Translate JSON to SQL(optional)
Logical View - Characteristics
Impacts
Out of scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident)
Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala
memcached_y, Redis
JSON-REST API ; JDBC; ODBC
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Logical View - Choices
Impacts
Out of scope
How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Store
JSON-API Persistence and run-time compute
Computation and Ingestion
Quick cache ( using a database for now)
Upstream: API layer, MSTR,
Adhoc access, Identity Service,
Ad-Serving manifests
Data Producers; Serving,
Scoring, Booking, 3rd/1st Party
Data
Real time and batch compute engine
(Hadoop/Storm )
Data filtering/transformations:
Transformations, format
conversions
Custom Algorithms : computing
recursive uniques, indexing
Human time, How?
Druid for interactive queries
Storm-Druid for quick ingestion and
index
Specialized computation and
processing for quicker response
› Sketches
› Feature sequence based overlaps
› Custom indexing
Problem Deepdive: Instant Overlap
Users
Car
commuters
Soccer
Fans
Vegans
Users
Car
Commuters
Soccer
Fans
Vegans
Overlap
Non-additive
› Require access to raw (user level data) to compute
non-additive
• Billions of events a day
• TBs of data a day
 1-1 vs 1-n vs few-n
› Between car commuter and vegan what is the overlap
› For Car commuter which are the top overlap groups
› For Vegan, Car commuters what are the top overlap
groups
Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
Overlap is like a AND operation or a set
Existing Approaches
● Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
 Hive, PIG, MR can all support efficient joins
 Exact but not real time
● Use sketches
o Approximate algorithms
 HLL, KMV, accuracy vs size, performance
 Approx, needs high perf tuning
 close to real time but not exact
Using Feature Sequences – 1/4
Feature sequence encoding
o Encode the sequence
 {Ram} - { car commuter, soccer fan,...}
 {Tom} - { soccer fan, vegan...}
 {Sam} - { car commuter, soccer fan, vegan...}
 ….
Using Feature Sequences – 2/4
Eliminate the user on encoded bitmaps
 {car commuter, soccer fan, vegan...}- count -c1- #
 {soccer fan, vegan...} - count - c2 - #
 {car commuter, vegan...} - count - c2 - #
Counts become additive now
Using Feature Sequences – 3/4
● Store row qualifications into a bitmap
o Car commuter- Row1, Row3
 1010000000
o Vegan - Row1, Row2, Row3
 1110000000
● Load the bitmap into Druid using a
custom indexer
o in-memory or memory mapped
Using Feature Sequences – 4/4
 Data Structures
› {feature_sequence}->count
› Feature->row qualification bitmaps
 AND is now an “AND” on bitmaps
› supported within Druid
› Very efficient
 Works alongside topN and
groupBys
Comparison with existing algorithm
● 1-n – Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( <1s)
● Instant Overlap
o < 60s ( pre-processing 3-4 hours)
o Supports “exact” AND
o Flexible ( few-n, 1-n)
o 1-1 ( < 1s)
Summary
● Yahoo’s Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optimizations with specialized
algorithms
o Data restatement - batch and realtime
o Human time
Thank You
@supreeth_
@_skgupta
We are hiring!
Reach out to us at
bigdata@yahoo-inc.com.
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Dimension Flexibility
Many dimensions
Adding new dimensions
Time zones
Time grain
Normalized view across systems
PaidSearch
Display
Native
Programmatic buying and
selling
Ad-targeting
Hardware Configs
●High-memory boxes
●SSD preferred
●Savings due to better
compression

More Related Content

PDF
Hadoop Summit 2010 Keynote
PPTX
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
PPTX
Introduction to Hadoop at Data-360 Conference
PPTX
Extractiv
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
PPTX
Recommendation engine
PPTX
Big Data tools in practice
Hadoop Summit 2010 Keynote
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Introduction to Hadoop at Data-360 Conference
Extractiv
Pinterest - Big Data Machine Learning Platform at Pinterest
Recommendation engine
Big Data tools in practice

What's hot (20)

PDF
Discovery & Consumption of Analytics Data @Twitter
PPTX
Bigdata
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PDF
Translating Models to Medicine an Example of Managing Visual Communications
PDF
Future of Data - Big Data
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
PPTX
Atlanta MLConf
PDF
Operationalizing Machine Learning at Scale at Starbucks
PPTX
Data science big data and analytics
PDF
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
PDF
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
PPTX
MLFlow as part of ML CI/CD at Avalara
PDF
LinkedinResume
PDF
AI Modernization at AT&T and the Application to Fraud with Databricks
PDF
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
PPTX
Optimize Data for the Logical Data Warehouse
PDF
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
PPTX
Introduction to hadoop
PDF
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
PPTX
AI from your data lake: Using Solr for analytics
Discovery & Consumption of Analytics Data @Twitter
Bigdata
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Translating Models to Medicine an Example of Managing Visual Communications
Future of Data - Big Data
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Atlanta MLConf
Operationalizing Machine Learning at Scale at Starbucks
Data science big data and analytics
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
MLFlow as part of ML CI/CD at Avalara
LinkedinResume
AI Modernization at AT&T and the Application to Fraud with Databricks
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Optimize Data for the Logical Data Warehouse
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Introduction to hadoop
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
AI from your data lake: Using Solr for analytics
Ad

Viewers also liked (12)

PPT
PLEs from virtual ethnography of web 2.0
PDF
MODUL/MEDIA PEMBELAJARAN SINARWATI LABINDA
DOCX
DOCX
Lesson Plan 20151016-Donald TANG
PPTX
We are don bosco
PDF
44. Instalowanie urządzeń peryferyjnych
PDF
iOS 7 Transition guide
DOCX
A casa de zaqueu
PPT
Leslie Bill
PPTX
Illustrations from mysteries of harris burdick
PDF
yoctoComm - Cellular Services over Wi-Fi
DOC
Un week celebration project proposals
PLEs from virtual ethnography of web 2.0
MODUL/MEDIA PEMBELAJARAN SINARWATI LABINDA
Lesson Plan 20151016-Donald TANG
We are don bosco
44. Instalowanie urządzeń peryferyjnych
iOS 7 Transition guide
A casa de zaqueu
Leslie Bill
Illustrations from mysteries of harris burdick
yoctoComm - Cellular Services over Wi-Fi
Un week celebration project proposals
Ad

Similar to June 2014 HUG: Interactive analytics over hadoop (20)

PPTX
Interactive Analytics in Human Time
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
PDF
Microsoft Big Data @ SQLUG 2013
PDF
Big data berlin
PPTX
Agile data warehousing
PDF
Meta scale kognitio hadoop webinar
PDF
Piano Media - approach to data gathering and processing
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
PDF
Predictive Analytics - BarCamp Boston 2011
PPTX
Check Point Big Data Forum m3
PPTX
ParStream - Big Data for Business Users
PPTX
Drill njhug -19 feb2013
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
DataJan27.pptxDataFoundationsPresentation
ODP
Big data nyu
PDF
Big Data and Implications on Platform Architecture
PPTX
Three Tools for "Human-in-the-loop" Data Science
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Interactive Analytics in Human Time
Counting Unique Users in Real-Time: Here's a Challenge for You!
Microsoft Big Data @ SQLUG 2013
Big data berlin
Agile data warehousing
Meta scale kognitio hadoop webinar
Piano Media - approach to data gathering and processing
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Predictive Analytics - BarCamp Boston 2011
Check Point Big Data Forum m3
ParStream - Big Data for Business Users
Drill njhug -19 feb2013
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
DataJan27.pptxDataFoundationsPresentation
Big data nyu
Big Data and Implications on Platform Architecture
Three Tools for "Human-in-the-loop" Data Science
Dirty data? Clean it up! - Datapalooza Denver 2016

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
sap open course for s4hana steps from ECC to s4
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.

June 2014 HUG: Interactive analytics over hadoop

  • 1. Interactive Analytics in Human Time S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 1 8 , 2 0 1 4 4 5 t h B a y A r e a H U G , S u n n yv a l e , C a l i f o r n i a
  • 2. Interactive – How we see it? 2 Yahoo Confidential & Proprietary 60B events, 3.5TB of compressed data Response 400ms Serve an ad and get insights < 2s
  • 6. Lots of data Analytics Data restatement - batch and real time Human time
  • 7. Lots of data ~30B advertising events/day ~10s of TB of compressed data/day Minutes to Year Grain Multi-quarter data retention Data Aging
  • 8. Analytics Reporting Metrics Attribution Multi-level hierarchical computation Bidding/Targeting optimization Non-additive computation
  • 9. Data Restatement Real time Batch Producer Consumer quick path, lower amount of checks or reconciliation, typically no lookups high latency path, checks and reconciliations, can have lookbacks and lookups
  • 10. Human Time <1s ( 99 percentile) Default time grain ( < 300 ms) Instant overlap ( < 60s) Data ingested, insights available ( < 2s)
  • 11. Lots of data Analytics Data restatement - batch and real time Human time
  • 13. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • 14. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Batch processing DAG, Real-time topology, SOX, Traffic protection, Late processing, Retention, Completeness Monitoring, PII cleansing/masking Compatible with HDFS, Performance (Indexed, Columnar, Compression, Serialization, Flexibility, Concurrency, Grain of data stored) Distributed/Stand-alone, Caching objects vs caching results Access to data with group by, order by etc..; SQL or SQL like Translate JSON to SQL(optional) Logical View - Characteristics Impacts Out of scope
  • 15. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident) Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala memcached_y, Redis JSON-REST API ; JDBC; ODBC Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Logical View - Choices Impacts Out of scope
  • 16. How we do what we do? Components of Advertising Data Warehouse Druid JDBC/ODBC Data Warehouse-Persistence Hive Metrics Store JSON-API Persistence and run-time compute Computation and Ingestion Quick cache ( using a database for now) Upstream: API layer, MSTR, Adhoc access, Identity Service, Ad-Serving manifests Data Producers; Serving, Scoring, Booking, 3rd/1st Party Data Real time and batch compute engine (Hadoop/Storm ) Data filtering/transformations: Transformations, format conversions Custom Algorithms : computing recursive uniques, indexing
  • 17. Human time, How? Druid for interactive queries Storm-Druid for quick ingestion and index Specialized computation and processing for quicker response › Sketches › Feature sequence based overlaps › Custom indexing
  • 21. Overlap Non-additive › Require access to raw (user level data) to compute non-additive • Billions of events a day • TBs of data a day  1-1 vs 1-n vs few-n › Between car commuter and vegan what is the overlap › For Car commuter which are the top overlap groups › For Vegan, Car commuters what are the top overlap groups
  • 22. Re-stating motivation Given two sets having identifiers, how can we do exact overlaps in close to real time? ( < 1 min). Overlap is like a AND operation or a set
  • 23. Existing Approaches ● Use exact compute paradigms o Do joins for intersections which will lead to exact results  Hive, PIG, MR can all support efficient joins  Exact but not real time ● Use sketches o Approximate algorithms  HLL, KMV, accuracy vs size, performance  Approx, needs high perf tuning  close to real time but not exact
  • 24. Using Feature Sequences – 1/4 Feature sequence encoding o Encode the sequence  {Ram} - { car commuter, soccer fan,...}  {Tom} - { soccer fan, vegan...}  {Sam} - { car commuter, soccer fan, vegan...}  ….
  • 25. Using Feature Sequences – 2/4 Eliminate the user on encoded bitmaps  {car commuter, soccer fan, vegan...}- count -c1- #  {soccer fan, vegan...} - count - c2 - #  {car commuter, vegan...} - count - c2 - # Counts become additive now
  • 26. Using Feature Sequences – 3/4 ● Store row qualifications into a bitmap o Car commuter- Row1, Row3  1010000000 o Vegan - Row1, Row2, Row3  1110000000 ● Load the bitmap into Druid using a custom indexer o in-memory or memory mapped
  • 27. Using Feature Sequences – 4/4  Data Structures › {feature_sequence}->count › Feature->row qualification bitmaps  AND is now an “AND” on bitmaps › supported within Druid › Very efficient  Works alongside topN and groupBys
  • 28. Comparison with existing algorithm ● 1-n – Bulk Overlap on grid o 19 hours on grid o Few-n calls for a re-process o 1-1 ( <1s) ● Instant Overlap o < 60s ( pre-processing 3-4 hours) o Supports “exact” AND o Flexible ( few-n, 1-n) o 1-1 ( < 1s)
  • 29. Summary ● Yahoo’s Advertising Data Warehouse o Peta Byte Scale o Normalized view across many systems o Analytics and optimizations with specialized algorithms o Data restatement - batch and realtime o Human time
  • 30. Thank You @supreeth_ @_skgupta We are hiring! Reach out to us at bigdata@yahoo-inc.com.
  • 31. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • 32. Dimension Flexibility Many dimensions Adding new dimensions Time zones Time grain
  • 33. Normalized view across systems PaidSearch Display Native Programmatic buying and selling Ad-targeting
  • 34. Hardware Configs ●High-memory boxes ●SSD preferred ●Savings due to better compression

Editor's Notes

  • #14: Logical view of a typical data driven application architecture Sox compliance
  • #17: -quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture - Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface
  • #18: Bitmap indexed, columnar, can operate on compressed bitmaps, distributed -