Simple Strategies for faster knowledge discovery in big data

Simple Strategies For
Faster Knowledge
Discovery In Big Data
Dr. Ritesh Agrawal
Lead Data Scientist
Zheng Shao
Staff Software Engineer

* 6 continents, 70+ countries, 450 cities
* 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years
* 5+ million trips every day

Infrastructure Challenge
*Help discover
*Quickly: having low query latency
*Efficiently: minimal cost

Data 
eg: log presto queries
Analytics
eg: 20% compute
resources used by timeout
queries
How to Build An Effective Infrastructure
Optimization 
eg: build query gate
Forecasting 
eg: predict future  
requirements

PRESTO HIVE
HDFS
KAFKA
*User
*Tables
*Join Clause
*Filter Clause
*Execution Related
Metrics
Data Analytics Optimization Forecasting

*Most queried tables

*Most expensive queries

*Top N users

*…

Overview Analytics
*Filter Clause

*Most Joined tables

*Hot & Cold Partitions…

Detailed Analytics

others
with
create
select
insert
PercentQueries
0
20
40
60
80
Number of Joins
0 1 2 3 4 5+
Pct of Total Queries Failure Pct in Bucket

90% of compute resources
is consumed by 10% most
expensive queries

* 90% of queries using [table A] filter on [column x]
* 70% of queries using [table A] join to [table B] on
column X
* 90% of queries using [table A, Table B] end using
Column X, Y from table A and Column Z from table B

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]

*A query that used to take 10 minutes now completes within 2 minutes. 80% reduction
in elapsed time
*95% reduction in resource consumption.

Partitioned By: Key Considerations
* Usage:

* Identify fields that are often used for filtering (FILTER CLAUSE)

* Cardinality:

* Low Cardinality — otherwise explodes meta-store

* Skewness:

* data should evenly distributed if all the keys are popular

* skewness is okie if filter value corresponds to smaller dataset

Optimization > Clustered By/Bucketing

* Identify tables that are often joined together.

* Cluster tables on join key.
Clustered By/Bucketing: Key Considerations

StorageSKEWED BY
* Row vs Col.

* Significant improvement
by moving to ORC or
parquet format.

* splits data so that heavy
values are stored in
separate files

* Helps with query
optimization.
Other techniques
* execution engine:
MapReduce Vs Tez

* Vectorization

* Pre-joined tables

Key Takeaways
! Fast and efficient infrastructure is key to a business’s success.
! Infrastructure Optimization is a constantly ongoing process.
! Infrastructure Data Science is key to building an efficient infrastructure
! Start simple

First & Last Name
Ritesh Agrawal,
Lead Data Scientist, Uber Inc.
Thank you

Simple Strategies for faster knowledge discovery in big data

More Related Content

What's hot (18)

Similar to Simple Strategies for faster knowledge discovery in big data (20)

Recently uploaded (20)

Simple Strategies for faster knowledge discovery in big data