SlideShare a Scribd company logo
Zhenxiao Luo
Software Engineer @ Uber
Even Faster:
When Presto Meets Parquet
@ Uber
Mission
Uber Business Highlights
Analytics Infrastructure @ Uber
Presto
Interactive SQL engine for Big Data
Parquet
Columnar Storage for Big Data
Parquet Optimizations for Presto
Ongoing Work
Agenda
Transportation as reliable as running water, everywhere, for everyone
Uber Mission
Uber Stats
6
Continents
73
Countries
450
Cities
12,000
Employees
10+ Million
Avg. Trips/Day
40+ Million
MAU Riders
1.5+ Million
MAU Drivers
Kafka
Analytics Infrastructure @ Uber
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
MemSQL
Modeled
Tables
Streaming
Warehouse
Real-time
Parquet @ Uber
Raw Tables
● No preprocessing
● Highly nested
● ~30 minutes ingestion latency
● Huge tables
Modeled Tables
● Preprocessing via Hive ETL
● Flattened
● ~12 hours ingestion latency
Scale of Presto @ Uber
● 2 clusters
○ Application cluster
■ Hundreds of machines
■ 100K queries per day
■ P90: 30s
○ Ad hoc cluster
■ Hundreds of machines
■ 20K queries per day
■ P90: 60s
● Access to both raw and model tables
○ 5 petabytes of data
● Total 120K+ queries per day
● Marketplace pricing
○ Real-time driver incentives
● Communication platform
○ Driver quality and action platform
○ Rider/driver cohorting
○ Ops, comms, & marketing
● Growth marketing
○ BI dashboard for growth marketing
● Data science
○ Exploratory analytics using notebooks
● Data quality
○ Freshness and quality check
● Ad hoc queries
Applications of Presto @ Uber
What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, & Netflix
Completely open source
Access to petabytes of data in the Hadoop data lake
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
○ Inline virtual function calls
○ Inline constants
○ Rewrite inner loops
○ Rewrite type-specific branches
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
Limitations
● No fault tolerance
● Joins do not fit in memory
○ Query fails
○ No OutOfMemory in Presto process
○ Try it on Hive
● Coordinator is a single point of failure
Presto Connectors
Parquet: Columnar Storage for Big Data
Parquet Optimizations for Presto
Example Query:
SELECT base.driver_uuid
FROM hdrone.mezzanine_trips
WHERE datestr = '2017-03-02' AND base.city_id in (12)
Data:
● Up to 15 levels of Nesting
● Up to 80 fields inside each Struct
● Fields are added/deleted/updated inside Struct
Old Parquet Reader
Nested Column Pruning
Columnar Reads
Predicate Pushdown
Dictionary Pushdown
Lazy Reads
Benchmarking Results
Ongoing Work
● Multi-tenancy support
● High availability for coordinator
● Geospatial optimization
● Authentication & authorization
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the
use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise
exempt from disclosure under applicable law. All recipients of this document are notified that the information contained
herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any
way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.
We are Hiring
https://www.uber.com/careers/list/27366/
Send resumes to:
abhik@uber.com or luoz@uber.com
Interested in learning more about Uber Eng?
Eng.uber.com
Follow us on Twitter:
@UberEng

More Related Content

PDF
Presto GeoSpatial @ Strata New York 2017
PDF
Presto @ Uber Hadoop summit2017
PDF
Uber Geo spatial data platform at DataWorks Summit
PDF
Presto@Uber
PDF
Real time analytics at uber @ strata data 2019
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Superset druid realtime
PPTX
presto-at-netflix-hadoop-summit-15
Presto GeoSpatial @ Strata New York 2017
Presto @ Uber Hadoop summit2017
Uber Geo spatial data platform at DataWorks Summit
Presto@Uber
Real time analytics at uber @ strata data 2019
Machine learning and big data @ uber a tale of two systems
Superset druid realtime
presto-at-netflix-hadoop-summit-15

What's hot (20)

PPTX
Challenges in Building a Data Pipeline
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
Presto talk @ Global AI conference 2018 Boston
PDF
Presto Summit 2018 - 03 - Starburst CBO
PDF
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
PDF
Presto Summit 2018 - 07 - Lyft
PDF
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
PDF
Presto Summit 2018 - 04 - Netflix Containers
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
Streaming Analytics @ Uber
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Microsoft cosmos
PPTX
Case study- Real-time OLAP Cubes
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
PDF
Presto Summit 2018 - 10 - Qubole
Challenges in Building a Data Pipeline
Introduction to Data Engineer and Data Pipeline at Credit OK
Presto@Netflix Presto Meetup 03-19-15
Presto talk @ Global AI conference 2018 Boston
Presto Summit 2018 - 03 - Starburst CBO
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Presto Summit 2018 - 07 - Lyft
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 09 - Netflix Iceberg
Presto: Optimizing Performance of SQL-on-Anything Engine
Streaming Analytics @ Uber
Presto Summit 2018 - 02 - LinkedIn
Microsoft cosmos
Case study- Real-time OLAP Cubes
Iceberg: A modern table format for big data (Strata NY 2018)
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Presto Summit 2018 - 10 - Qubole
Ad

Similar to Presto Apache BigData 2017 (20)

PDF
Even Faster: When Presto meets Parquet @ Uber
PDF
Real time analytics on deep learning @ strata data 2019
PPTX
Geospatial data platform at Uber
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
Archmage, Pinterest’s Real-time Analytics Platform on Druid
PDF
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
The Lyft data platform: Now and in the future
PDF
Lyft data Platform - 2019 slides
PPTX
Make your data fly - Building data platform in AWS
PDF
Scaling up uber's real time data analytics
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Hive on Spark, production experience @Uber
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...
PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
ODP
Presto
PDF
Presto at Hadoop Summit 2016
PDF
What's new in SQL on Hadoop and Beyond
Even Faster: When Presto meets Parquet @ Uber
Real time analytics on deep learning @ strata data 2019
Geospatial data platform at Uber
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Enabling Presto Caching at Uber with Alluxio
The Lyft data platform: Now and in the future
Lyft data Platform - 2019 slides
Make your data fly - Building data platform in AWS
Scaling up uber's real time data analytics
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Hive on Spark, production experience @Uber
AWS Big Data Demystified #1: Big data architecture lessons learned
ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Presto
Presto at Hadoop Summit 2016
What's new in SQL on Hadoop and Beyond
Ad

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
Business Analytics and business intelligence.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Introduction to the R Programming Language
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Transcultural that can help you someday.
SAP 2 completion done . PRESENTATION.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
annual-report-2024-2025 original latest.
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Data Science and Data Analysis
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IBA_Chapter_11_Slides_Final_Accessible.pptx
[EN] Industrial Machine Downtime Prediction
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
Business Analytics and business intelligence.pdf
Quality review (1)_presentation of this 21
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to the R Programming Language
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Presto Apache BigData 2017

  • 1. Zhenxiao Luo Software Engineer @ Uber Even Faster: When Presto Meets Parquet @ Uber
  • 2. Mission Uber Business Highlights Analytics Infrastructure @ Uber Presto Interactive SQL engine for Big Data Parquet Columnar Storage for Big Data Parquet Optimizations for Presto Ongoing Work Agenda
  • 3. Transportation as reliable as running water, everywhere, for everyone Uber Mission
  • 4. Uber Stats 6 Continents 73 Countries 450 Cities 12,000 Employees 10+ Million Avg. Trips/Day 40+ Million MAU Riders 1.5+ Million MAU Drivers
  • 5. Kafka Analytics Infrastructure @ Uber Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink MemSQL Modeled Tables Streaming Warehouse Real-time
  • 6. Parquet @ Uber Raw Tables ● No preprocessing ● Highly nested ● ~30 minutes ingestion latency ● Huge tables Modeled Tables ● Preprocessing via Hive ETL ● Flattened ● ~12 hours ingestion latency
  • 7. Scale of Presto @ Uber ● 2 clusters ○ Application cluster ■ Hundreds of machines ■ 100K queries per day ■ P90: 30s ○ Ad hoc cluster ■ Hundreds of machines ■ 20K queries per day ■ P90: 60s ● Access to both raw and model tables ○ 5 petabytes of data ● Total 120K+ queries per day
  • 8. ● Marketplace pricing ○ Real-time driver incentives ● Communication platform ○ Driver quality and action platform ○ Rider/driver cohorting ○ Ops, comms, & marketing ● Growth marketing ○ BI dashboard for growth marketing ● Data science ○ Exploratory analytics using notebooks ● Data quality ○ Freshness and quality check ● Ad hoc queries Applications of Presto @ Uber
  • 9. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, & Netflix Completely open source Access to petabytes of data in the Hadoop data lake
  • 11. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation ○ Inline virtual function calls ○ Inline constants ○ Rewrite inner loops ○ Rewrite type-specific branches
  • 12. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  • 13. Limitations ● No fault tolerance ● Joins do not fit in memory ○ Query fails ○ No OutOfMemory in Presto process ○ Try it on Hive ● Coordinator is a single point of failure
  • 16. Parquet Optimizations for Presto Example Query: SELECT base.driver_uuid FROM hdrone.mezzanine_trips WHERE datestr = '2017-03-02' AND base.city_id in (12) Data: ● Up to 15 levels of Nesting ● Up to 80 fields inside each Struct ● Fields are added/deleted/updated inside Struct
  • 24. Ongoing Work ● Multi-tenancy support ● High availability for coordinator ● Geospatial optimization ● Authentication & authorization
  • 25. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are Hiring https://www.uber.com/careers/list/27366/ Send resumes to: abhik@uber.com or luoz@uber.com Interested in learning more about Uber Eng? Eng.uber.com Follow us on Twitter: @UberEng