SlideShare a Scribd company logo
Interactive Batch Query
At Scale
Adhoc query system for game analytics
based on Drill
immars@gmail.com

!1
Related Topics
•

Java Programming

•

Relational Algebra

•

Distributed Database

•

Hadoop Ecosystem

!2
About Us
•

Elex-tech

•

Game Development, Game Publishing

•

SNS Games, Web Games, Mobile Games, Apps

•

Global Market

!3
•

The Problem!

•

Brief on Drill

•

Design Considerations

•

Enhancement from Xingcloud

•

Now & Future

!4
The Problem

!5
The Problem
•

How many logins today?

•

How many individual users this week?

•

Total income today?

•

Paid user amount this month?

•

…
!6
The Problem: Facts
•

How many X during time period of Y

!

•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

pay

4.99

1383729084

user_003

login

-

1383729090

Fact Table
!7

timestamp
The Problem: Facts
•

How many logins today?

•

How many individual users this week?

•

Total income today?

•

Paid user amount this month?

•

…
!8
The Problem: Facts
•

How many logins today?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

pay

4.99

1383729084

user_003

login

-

1383729090

timestamp

select count(*) from fact where event=‘login’ and
date(timestamp)=‘2013-12-06’;

!9
The Problem: Facts
•

How many individual users this week?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select count(distinct uid) from fact where event=‘login’ and
timestamp>=‘?’ and timestamp<‘?’;

!10
The Problem: Facts
•

Total income today?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select sum(amount) from fact where event=‘pay’ and timestamp
>=‘?’ and timestamp<‘?’;

!11
The Problem: Facts
•

Paid user amount this month?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select count(distinct uid) from fact where event=‘pay’ and
timestamp >=‘?’ and timestamp<‘?’;

!12
The Problem: Dimensions
•

How many logins today from China?

•

How many individual users of each server this
week?

•

Total income today by new user?

•

Paid user amount this month from Adwords?

•

…
!13
The Problem: Dimensions
•

The user X’s property Y is of value Z

!

•

refer

en

adwords

user_002 20110927

cn

facebook

user_003 20121010

!

language

user_001 20100612

!

fr

admob

user_004 20130522

it

tapjoy

user id

reg_time

Dimension Table
!14

…
Fact & Dimension
•

Aggregation on Join
user id
user_001
user_002
user_001
user_003
user id
user_001
user_002
user_003
user_004

event
login
login
pay
login

amount
4.99
-

timestamp
1383729081
1383729082
1383729084
1383729090

reg_time language refer
20100612
en
adwords
20110927
cn
facebook
20121010
fr
admob
20130522
it
tapjoy
!15

…
Fact & Dimension
•

How many logins today from China?

•

How many individual users of each server this
week?

•

Total income today by new user?

•

Paid user amount this month from adwords?

•

…
!16
Fact & Dimension
SELECT COUNT DISTINCT (on uid)
JOIN (1 fact, n dimension, on uid)
WHERE (filter by value of dimensions/facts)
GROUP BY (value of dimension)

!17
Fact & Dimension
•

SQL
agg

•

-> Syntax tree
Join

•
•

-> Logical Plan
-> Physical Plan

Join
filter

filter

filter

scan:
Dimension

scan:
Dimension

scan:
Fact
pre-aggregation?

!19
!20
Combinatorial Explosion!
!21
Access Pattern
Facts

Write

Read by

Dimensions

Append

Insert,
update

date
event

user id
prop value
full table

!22
Volume

•

200GB new Facts

•

50GB Dimension updates

!23
Architecture
Query

Drill
MySQL
StorageEngine

HBase
StorageEngine

Storage
Data Loader

MySQL

!24

HBase
•

The Problem

•

Brief on Drill!

•

Design Considerations

•

Our work

•

Now & Future

!25
http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac
!26
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
!27
•

The Problem

•

Brief on Drill

•

Design Considerations!

•

Our work

•

Now & Future

!28
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
!29
Data Model
{
name: "icecream",

•

Various types

•

Nested values

price: {
basic: 4.99,

•

coupon: true
•

}
}
!30

price.basic

Schema-free
Design Considerations
•

As Fast As possible
•

Space efficient

•

Time efficient

!31
about Space Efficiency
•

Compact data representation
•

•

Java object overhead: high

JVM friendly(GC)
•

Simpler object graph

•

Less tenured space, less full GC
!32
about Time Efficiency
•

Cache friendly
•

•

Superscalar: pipeline friendly
•

•

the inner loop problem

SIMD friendly
•

•

data access Locality

opportunity to operate on a vector of values

JVM friendly(JNI)
!33
ValueVector & RecordBatch

ValueVector
!34
ValueVector & RecordBatch
•

ValueVector
•

small memory overhead

•

backed by DirectByteBuffer

•

further encoding

•

continuous access/random access
!35
ValueVector & RecordBatch
{

name:VarChar

i
c
e
c
r
e
a
m
…

name: "icecream",
price: {
basic: 4.99,
coupon: true

price.coupon:boolean

price.basic:float

4.99
…

}
}

RecordBatch
!36

T
…
ValueVector & RecordBatch
scan:
Dimension

filter

Join

filter

•

Data passed in RecordBatch

•

Inner loop: next() vs for

!37

scan:
Fact

agg
Review the Considerations
•

name:VarCh

Cache friendly

•

Superscalar: pipeline friendly

•

SIMD friendly

•

Compact data representation

•

JVM friendly(GC)

•

JVM friendly(JNI)
!38

price.coupon:boole

i price.basic:flo
c
4.99
e
…
c
r
e
a
m
…

T
…
•

The Problem

•

Brief on Drill

•

Design Considerations

•

Our work!

•

Now & Future

!39
Our work, primarily

•

Adhoc batch query

!40
Reports: 2-dimensional tables generally

!41
Adhoc batch query
DailyActiveUser

2013-07-26

2013-07-27

en

576

491

cn

361

945

!42
Adhoc batch query
Fact
user id

event

time

user_13

login

2013-07-26

user_13

login

2013-07-26

user_76

pay

2013-07-27

Dimension
user id

nation

user_13

cn

user_76

en

DAU

2013-07-26 2013-07-27

en

576

491

cn

361

945

!43
Adhoc batch query
DAU

2013-07-26

2013-07-27

en

576

491

cn

361

945

!44
Adhoc batch query
scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

DAU
scan:
Dimension

date=‘2013-07-27’

2013-07-26

filter

scan:
Dimension

Join

nation=‘en’

en

filter

Join

nation=‘en’

agg

scan:
Fact

2013-07-27

scan:
Fact 491

576

filter

filter

date=‘2013-07-26’

scan:
filter
Dimension
cn

scan:
Dimension
361

Join

nation=‘cn’

agg

date=‘2013-07-27’

filter

Join

nation=‘cn’

agg
!45

945
agg
scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

scan:
Dimension

filter

scan:
Dimension

Join

nation=‘en’

date=‘2013-07-27’

filter

Join

nation=‘en’

agg

agg

scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

scan:
Dimension

filter

scan:
Dimension

Join

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’

agg
!46

agg
scan:
Fact
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg

Join

nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
filter

Join

agg

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’
!47

agg
Adhoc batch query
•

Benefits
•
•

•

Reduce the same Scans
Merge similar Scans

Possibility
•

SQL usually Parses into Tree, while

•

LogicalPlan in Drill is DAG
!48
More Benefits:
Middle result reuse

!49
scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg

Join

nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
filter

Join

agg

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’
!50

agg
scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
Join

agg

date=‘2013-07-27’

Filter
agg

Join
nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
Join

agg

date=‘2013-07-27’

Filter
Join
nation=‘cn’

!51

agg
scan:
Fact

Adhoc batch query
Filter
date=‘2013-07-26’

Filter
Join

agg

date=‘2013-07-27’

Filter

agg

Join
nation=‘en’

scan:
Dimension

Join

agg

Filter
Join
nation=‘cn’

!52

agg
More Benefits:
More Batched,
More Offline

!53
Single Query
!54
Batched 3 Queries
!55
Batched Query, from a report
!56
Batched Query, from tens of reports, with 1k+ operators
!57
Jobs vs Predictions
•

Offline job
•

becomes predictions of what data user may
be interested in

•

by merging more query together

•

daily predictions & hourly predictions

!58
More Benefits:
Utilising multi-core

!59
Utilising Multi-core
•

Original:
agg

•

Pull data from root
Join

•

Downwards recursively

filter

nation=‘en’

scan:
Dimension

!60

filter
date=‘2013-07-26’

scan:
Fact
Utilising Multi-core
•

Now:
agg

•

Push data from Leaf
Join

•
•

Data driven upwards
Pooled execution

filter

nation=‘en’

scan:
Dimension

!61

filter
date=‘2013-07-26’

scan:
Fact
Adhoc batch query
•

Benefits
•

Reduce the same Scans

•

Merge similar Scans

•

Merge intermediate operators

•

Unified process for adhoc & batch process

•

Multi-core process of single Plan
!62
•

The Problem

•

Brief on Drill

•

Design Considerations

•

Our work

•

Now & Future

!63
About Xingcloud
•

Now
•
•

2 billion insert/update daily

•

200k+ aggregation data/day, 6k sec in total

•
•

http://a.xingcloud.com

query response time: <1sec - 100 sec, 10 sec on avg.

Future
•

Plan Merge

•

Unified process for batch, adhoc & stream process, SQL oriented

•

SQL(t): Plan with time window
!64
About Drill
•

Now
•
•

on Parquet/ORCFile on HDFS

•
•

Distributed Join

Write interface of storage engines

Future
•

1.0 M2: December 2013

•

1.0 GA: Early 2014

•

more detail on https://issues.apache.org/jira/browse/DRILL
!65
References
•

http://incubator.apache.org/drill/index.html#resources

•

http://www.slideshare.net/jasonfrantz/drill-architecture-20120913

•

http://prezi.com/j43vb1umlgqv/timothy-chen/

•

http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefficient-java-tutorial.pdf

•

http://www.cs.yale.edu/homes/dna/talks/
Column_Store_Tutorial_VLDB09.pdf

!66
Q&A

!67

More Related Content

PDF
Rabbit mq簡介(上)
PDF
Embracing Clojure: a journey into Clojure adoption
PDF
Codemash-Clojure.pdf
PDF
Summit 16: Open-O Mini-Summit - Vision and Update
PPTX
Planificación y Control de Procesos (PCP)
PPTX
How to write a Neutron Plugin - if you really need to
PDF
From Java To Clojure (English version)
PDF
Inside neutron 2
Rabbit mq簡介(上)
Embracing Clojure: a journey into Clojure adoption
Codemash-Clojure.pdf
Summit 16: Open-O Mini-Summit - Vision and Update
Planificación y Control de Procesos (PCP)
How to write a Neutron Plugin - if you really need to
From Java To Clojure (English version)
Inside neutron 2

Viewers also liked (10)

PDF
Clojure: Towards The Essence Of Programming (What's Next? Conference, May 2011)
PDF
Ring: Web Apps in Idiomatic Clojure
PDF
Introduction to clojure
PDF
Using Clojure, NoSQL Databases and Functional-Style JavaScript to Write Gext-...
PDF
Clojure: The Art of Abstraction
PDF
Machine Learning to Grow the World's Knowledge
PDF
DAMA Webinar - Big and Little Data Quality
PDF
Visualising Data with Code
PDF
GPU Computing for Data Science
PDF
Visual Design with Data
Clojure: Towards The Essence Of Programming (What's Next? Conference, May 2011)
Ring: Web Apps in Idiomatic Clojure
Introduction to clojure
Using Clojure, NoSQL Databases and Functional-Style JavaScript to Write Gext-...
Clojure: The Art of Abstraction
Machine Learning to Grow the World's Knowledge
DAMA Webinar - Big and Little Data Quality
Visualising Data with Code
GPU Computing for Data Science
Visual Design with Data
Ad

Similar to 穆黎森:Interactive batch query at scale (20)

PPTX
Florian Pertynski session at Google Partner Summit Review
PDF
Minimum viable product to delivery business value
PPTX
Big Objects in Salesforce
PDF
Analytics in Your Enterprise
PDF
Building the BI system and analytics capabilities at the company based on Rea...
PDF
Logs & Visualizations at Twitter
PDF
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
PDF
Appboy analytics - NYC MUG 11/19/13
PDF
[@IndeedEng] Large scale interactive analytics with Imhotep
PDF
MicroStrategy at Badoo
PDF
Data Collection and Consumption
PDF
Frappe Open Day - August 2018
PDF
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
PDF
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
PPTX
Machine learning with Spark : the road to production
PDF
Elasticsearch : petit déjeuner du 13 mars 2014
PDF
Minimum viable product_to_deliver_business_value_v0.4
PDF
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
PDF
Before vs After: Redesigning a Website to be Useful and Informative for Devel...
PPTX
Dollars and Sense of Sharing Threat Intelligence
Florian Pertynski session at Google Partner Summit Review
Minimum viable product to delivery business value
Big Objects in Salesforce
Analytics in Your Enterprise
Building the BI system and analytics capabilities at the company based on Rea...
Logs & Visualizations at Twitter
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Appboy analytics - NYC MUG 11/19/13
[@IndeedEng] Large scale interactive analytics with Imhotep
MicroStrategy at Badoo
Data Collection and Consumption
Frappe Open Day - August 2018
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Machine learning with Spark : the road to production
Elasticsearch : petit déjeuner du 13 mars 2014
Minimum viable product_to_deliver_business_value_v0.4
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
Before vs After: Redesigning a Website to be Useful and Informative for Devel...
Dollars and Sense of Sharing Threat Intelligence
Ad

More from hdhappy001 (20)

PDF
詹剑锋:Big databench—benchmarking big data systems
PDF
翟艳堂:腾讯大规模Hadoop集群实践
PDF
袁晓如:大数据时代可视化和可视分析的机遇与挑战
PDF
俞晨杰:Linked in大数据应用和azkaban
PDF
杨少华:阿里开放数据处理服务
PDF
薛伟:腾讯广点通——大数据之上的实时精准推荐
PDF
徐萌:中国移动大数据应用实践
PDF
肖永红:科研数据应用和共享方面的实践
PDF
肖康:Storm在实时网络攻击检测和分析的应用与改进
PDF
夏俊鸾:Spark——基于内存的下一代大数据分析框架
PDF
魏凯:大数据商业利用的政策管制问题
PDF
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
PDF
王峰:阿里搜索实时流计算技术
PDF
钱卫宁:在线社交媒体分析型查询基准评测初探
PDF
罗李:构建一个跨机房的Hadoop集群
PDF
刘书良:基于大数据公共云平台的Dsp技术
PDF
刘诚忠:Running cloudera impala on postgre sql
PDF
刘昌钰:阿里大数据应用平台
PDF
李战怀:大数据背景下分布式系统的数据一致性策略
PDF
冯宏华:H base在小米的应用与扩展
詹剑锋:Big databench—benchmarking big data systems
翟艳堂:腾讯大规模Hadoop集群实践
袁晓如:大数据时代可视化和可视分析的机遇与挑战
俞晨杰:Linked in大数据应用和azkaban
杨少华:阿里开放数据处理服务
薛伟:腾讯广点通——大数据之上的实时精准推荐
徐萌:中国移动大数据应用实践
肖永红:科研数据应用和共享方面的实践
肖康:Storm在实时网络攻击检测和分析的应用与改进
夏俊鸾:Spark——基于内存的下一代大数据分析框架
魏凯:大数据商业利用的政策管制问题
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王峰:阿里搜索实时流计算技术
钱卫宁:在线社交媒体分析型查询基准评测初探
罗李:构建一个跨机房的Hadoop集群
刘书良:基于大数据公共云平台的Dsp技术
刘诚忠:Running cloudera impala on postgre sql
刘昌钰:阿里大数据应用平台
李战怀:大数据背景下分布式系统的数据一致性策略
冯宏华:H base在小米的应用与扩展

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25-Week II
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
sap open course for s4hana steps from ECC to s4

穆黎森:Interactive batch query at scale