SlideShare a Scribd company logo
Simple Strategies For
Faster Knowledge
Discovery In Big Data
Dr. Ritesh Agrawal
Lead Data Scientist
Zheng Shao
Staff Software Engineer
* 6 continents, 70+ countries, 450 cities
* 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years
* 5+ million trips every day
Data-Driven Culture
Infrastructure Challenge
*Help discover
*Quickly: having low query latency
*Efficiently: minimal cost
Data

eg: log presto queries
Analytics
eg: 20% compute
resources used by timeout
queries
How to Build An Effective Infrastructure
Optimization

eg: build query gate
Forecasting

eg: predict future 

requirements
PRESTO HIVE
HDFS
KAFKA
*User
*Tables
*Join Clause
*Filter Clause
*Execution Related
Metrics
Data Analytics Optimization Forecasting
*Most queried tables

*Most expensive queries

*Top N users

*…

Overview Analytics
*Filter Clause

*Most Joined tables

*Hot & Cold Partitions…

Detailed Analytics
Data Analytics Optimization Forecasting
Data Analytics Optimization Forecasting
others
with
create
select
insert
PercentQueries
0
20
40
60
80
Number of Joins
0 1 2 3 4 5+
Pct of Total Queries Failure Pct in Bucket
90% of compute resources
is consumed by 10% most
expensive queries
* 90% of queries using [table A] filter on [column x]
* 70% of queries using [table A] join to [table B] on
column X
* 90% of queries using [table A, Table B] end using
Column X, Y from table A and Column Z from table B
Data Analytics Optimization Forecasting
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00000
|—date=20170102
|—file_00000
|—file_00000
|—date=20170103
|—file_00000
|—file_00000
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00001
|—date=20170102
|—file_00000
|—file_00001
|—date=20170103
|—file_00000
|—file_00001
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—event_type=‘click’
|—file_00000
|—file_00001
|—date=20170102
|—event_type=‘click’
|—file_00000
|—file_00001
|—event_type=‘map’
|—file_00000
|—file_00001
|—….
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
*A query that used to take 10 minutes now completes within 2 minutes. 80% reduction
in elapsed time
*95% reduction in resource consumption.
Partitioned By: Key Considerations
* Usage: 

* Identify fields that are often used for filtering (FILTER CLAUSE)

* Cardinality: 

* Low Cardinality — otherwise explodes meta-store

* Skewness:

* data should evenly distributed if all the keys are popular

* skewness is okie if filter value corresponds to smaller dataset
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
Optimization > Clustered By/Bucketing
Optimization > Clustered By/Bucketing
* Identify tables that are often joined together. 

* Cluster tables on join key.
Clustered By/Bucketing: Key Considerations
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
StorageSKEWED BY
* Row vs Col.

* Significant improvement
by moving to ORC or
parquet format. 

* splits data so that heavy
values are stored in
separate files

* Helps with query
optimization.
Data Analytics Optimization Forecasting
Other techniques
* execution engine:
MapReduce Vs Tez

* Vectorization

* Pre-joined tables
Key Takeaways
! Fast and efficient infrastructure is key to a business’s success.
! Infrastructure Optimization is a constantly ongoing process.
! Infrastructure Data Science is key to building an efficient infrastructure
! Start simple
First & Last Name
Ritesh Agrawal,
Lead Data Scientist, Uber Inc.
Thank you

More Related Content

PDF
Prediction of Skierdays With Oracle Data Mining - OGB EMEA Edition
PDF
Prediction of Skierdays with Oracle Data Mining - Analytics and Data Techcast...
PPT
Advance xpath
DOCX
supporting t-sql scripts for Heap vs clustered table
PDF
3 descriptive statistics with R
PPT
HTML 5 Tables and Forms
DOCX
Parallel Server
PDF
learn you some erlang - chap 9 to chap10
Prediction of Skierdays With Oracle Data Mining - OGB EMEA Edition
Prediction of Skierdays with Oracle Data Mining - Analytics and Data Techcast...
Advance xpath
supporting t-sql scripts for Heap vs clustered table
3 descriptive statistics with R
HTML 5 Tables and Forms
Parallel Server
learn you some erlang - chap 9 to chap10

What's hot (18)

PPT
Sql
PPTX
Html table
PDF
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
PDF
Incanter Data Sorcery
PDF
A Tour to MySQL Commands
PPTX
Tables and Forms in HTML
DOCX
Some Examples in R- [Data Visualization--R graphics]
PPTX
PPTX
HTML: Tables and Forms
PDF
Sql queries - Basics
PDF
Efficient spatial queries on vanilla databases
PPTX
Introduction to SQL (for Chicago Booth MBA technology club)
PPTX
Web design - Working with tables in HTML
PDF
Hadoop Summit EU 2014
PPT
R workshop
PDF
Internal DSLs Scala
PDF
Indexing and Query Optimizer (Richard Kreuter)
PPTX
Html Table
Sql
Html table
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Incanter Data Sorcery
A Tour to MySQL Commands
Tables and Forms in HTML
Some Examples in R- [Data Visualization--R graphics]
HTML: Tables and Forms
Sql queries - Basics
Efficient spatial queries on vanilla databases
Introduction to SQL (for Chicago Booth MBA technology club)
Web design - Working with tables in HTML
Hadoop Summit EU 2014
R workshop
Internal DSLs Scala
Indexing and Query Optimizer (Richard Kreuter)
Html Table
Ad

Similar to Simple Strategies for faster knowledge discovery in big data (20)

PDF
Run your queries 14X faster without any investment!
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
PDF
Don’t optimize my queries, optimize my data!
PDF
Advanced data modeling with apache cassandra
PPTX
Presentation_BigData_NenaMarin
PDF
Introduction to Dating Modeling for Cassandra
PPTX
Modernizing Your Data Warehouse using APS
PPTX
Lazy beats Smart and Fast
PPTX
Advanced SQL - Quebec 2014
PPTX
Relational Database to Apache Spark (and sometimes back again)
PDF
Avoiding big data antipatterns
PDF
Why PostgreSQL for Analytics Infrastructure (DW)?
PPTX
Leveraging partition enhancements
PPTX
Big Data Analytics MIS presentation
PDF
Lect. 7 - MIS and business analytics.pdf
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Analysis Services Best Practices From Large Deployments
PPT
TCC14 tour hague optimising workbooks
PDF
Data Bases - Introduction to data science
DOCX
OverviewThis notebook will show you how to create and query .docx
Run your queries 14X faster without any investment!
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
Don’t optimize my queries, optimize my data!
Advanced data modeling with apache cassandra
Presentation_BigData_NenaMarin
Introduction to Dating Modeling for Cassandra
Modernizing Your Data Warehouse using APS
Lazy beats Smart and Fast
Advanced SQL - Quebec 2014
Relational Database to Apache Spark (and sometimes back again)
Avoiding big data antipatterns
Why PostgreSQL for Analytics Infrastructure (DW)?
Leveraging partition enhancements
Big Data Analytics MIS presentation
Lect. 7 - MIS and business analytics.pdf
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Analysis Services Best Practices From Large Deployments
TCC14 tour hague optimising workbooks
Data Bases - Introduction to data science
OverviewThis notebook will show you how to create and query .docx
Ad

Recently uploaded (20)

DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Introduction to Windows Operating System
PDF
STL Containers in C++ : Sequence Container : Vector
How to Use SharePoint as an ISO-Compliant Document Management System
Complete Guide to Website Development in Malaysia for SMEs
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Designing Intelligence for the Shop Floor.pdf
Patient Appointment Booking in Odoo with online payment
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Advanced SystemCare Ultimate Crack + Portable (2025)
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Windows Operating System
STL Containers in C++ : Sequence Container : Vector

Simple Strategies for faster knowledge discovery in big data

  • 1. Simple Strategies For Faster Knowledge Discovery In Big Data Dr. Ritesh Agrawal Lead Data Scientist Zheng Shao Staff Software Engineer
  • 2. * 6 continents, 70+ countries, 450 cities * 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years * 5+ million trips every day
  • 4. Infrastructure Challenge *Help discover *Quickly: having low query latency *Efficiently: minimal cost
  • 5. Data
 eg: log presto queries Analytics eg: 20% compute resources used by timeout queries How to Build An Effective Infrastructure Optimization
 eg: build query gate Forecasting
 eg: predict future 
 requirements
  • 6. PRESTO HIVE HDFS KAFKA *User *Tables *Join Clause *Filter Clause *Execution Related Metrics Data Analytics Optimization Forecasting
  • 7. *Most queried tables *Most expensive queries *Top N users *… Overview Analytics *Filter Clause *Most Joined tables *Hot & Cold Partitions… Detailed Analytics Data Analytics Optimization Forecasting
  • 8. Data Analytics Optimization Forecasting others with create select insert PercentQueries 0 20 40 60 80 Number of Joins 0 1 2 3 4 5+ Pct of Total Queries Failure Pct in Bucket
  • 9. 90% of compute resources is consumed by 10% most expensive queries
  • 10. * 90% of queries using [table A] filter on [column x] * 70% of queries using [table A] join to [table B] on column X * 90% of queries using [table A, Table B] end using Column X, Y from table A and Column Z from table B Data Analytics Optimization Forecasting
  • 11. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 12. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 13. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—file_00000 |—file_00000 |—date=20170102 |—file_00000 |—file_00000 |—date=20170103 |—file_00000 |—file_00000 SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 14. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—file_00000 |—file_00001 |—date=20170102 |—file_00000 |—file_00001 |—date=20170103 |—file_00000 |—file_00001 SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 15. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—event_type=‘click’ |—file_00000 |—file_00001 |—date=20170102 |—event_type=‘click’ |—file_00000 |—file_00001 |—event_type=‘map’ |—file_00000 |—file_00001 |—…. SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 16. Optimization > Partitioned By *A query that used to take 10 minutes now completes within 2 minutes. 80% reduction in elapsed time *95% reduction in resource consumption.
  • 17. Partitioned By: Key Considerations * Usage: * Identify fields that are often used for filtering (FILTER CLAUSE) * Cardinality: * Low Cardinality — otherwise explodes meta-store * Skewness: * data should evenly distributed if all the keys are popular * skewness is okie if filter value corresponds to smaller dataset
  • 18. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 19. Optimization > Clustered By/Bucketing
  • 20. Optimization > Clustered By/Bucketing
  • 21. * Identify tables that are often joined together. * Cluster tables on join key. Clustered By/Bucketing: Key Considerations
  • 22. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 23. StorageSKEWED BY * Row vs Col. * Significant improvement by moving to ORC or parquet format. * splits data so that heavy values are stored in separate files * Helps with query optimization. Data Analytics Optimization Forecasting Other techniques * execution engine: MapReduce Vs Tez * Vectorization * Pre-joined tables
  • 24. Key Takeaways ! Fast and efficient infrastructure is key to a business’s success. ! Infrastructure Optimization is a constantly ongoing process. ! Infrastructure Data Science is key to building an efficient infrastructure ! Start simple
  • 25. First & Last Name Ritesh Agrawal, Lead Data Scientist, Uber Inc. Thank you