SlideShare a Scribd company logo
Presto @ Netflix: Interactive Queries
at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo
Big Data Platform
Outline
» Big data platform @ Netflix
» Why we love Presto?
» Our contributions
» What are we working on?
» What else we need?
Cloud
Apps
S3
Suro Ursula
SSTable
s
Cassandra Aegisthus
Event Data
15m
Daily
Dimension Data
Our Data Pipeline
Data
Warehouse
Service
Tool
s
Gateways
Big Data Platform Architecture
Prod
Clients
Clusters
VPCQuery Prod TestBonusProd
» Batch jobs (Pig, Hive)
» ETL jobs
» reporting and other analysis
» Ad-hoc queries
» interactive data exploration
» Looked at Impala, Redshift, Spark, and Presto
Our Use Cases
Deployment
» v 0.86
» 1 coordinator (r3.4xlarge)
» 250 workers (m2.4xlarge)
Tooling
Numbers
» ~2.5K queries/day against our 10PB Hive DW on S3
» 230+ Presto users out of 300+ platform users
» presto-cli, Python, R,
BI tools (ODBC/JDBC), etc.
» Atlas/Suro for monitoring/logging
Presto @ Netflix
Why we love Presto?
» Open source
» Fast
» Scalable
» Works well on AWS
» Good integration with the Hadoop stack
» ANSI SQL
Our Contributions
24 open PRs, 60+ commits
» S3 file system
» multipart upload, IAM roles, retries, monitoring, etc.
» Functions for complex types
» Parquet
» name/index-based access, type coercion, etc.
» Query optimization
» Various other bug fixes
» Vectorized reader* Read based on column vectors
» Predicate pushdown Use statistics to skip data
» Lazy load Postpone loading the data until needed
» Lazy materialization Postpone decoding the data until needed
What are we Working on?
Parquet Optimizations
* PARQUET-
Netflix Integration
» BI tools integration
» ODBC driver, Tableau web connector, etc.
» Better monitoring
» Ganglia ⟶ Atlas
» Data lineage
» Presto ⟶ Suro ⟶ Charlotte
» Graceful cluster shrink
» Better resource management
» Dynamic type coercion for all file formats
» Support for more Hive types (e.g., decimal)
» Predictable metastore cache behavior
» Big table joins similar to Hive
What else we need?
THANK YOU

More Related Content

PPTX
Presto Talk @ Hadoop Summit'15
PPTX
An Architect's guide to real time big data systems
PPTX
Big Data Pipeline and Analytics Platform
PDF
Making it easy to work with data
PPTX
Putting Lipstick on Apache Pig at Netflix
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PPTX
Dato vs GraphX
PPTX
OSCON 2015
Presto Talk @ Hadoop Summit'15
An Architect's guide to real time big data systems
Big Data Pipeline and Analytics Platform
Making it easy to work with data
Putting Lipstick on Apache Pig at Netflix
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Dato vs GraphX
OSCON 2015

What's hot (20)

PDF
Artmosphere Demo
PPT
Hw09 Hadoop Applications At Yahoo!
PDF
Monitoring pg with_graphite_grafana
PDF
Graph Processing with Apache TinkerPop
PPTX
Building Scalable Aggregation Systems
PPTX
Latest Developments in Oceanographic Applications of GIS, including Near-real...
PPTX
InfluxDb and Grafana fighting with data
PPTX
Need for Time series Database
PDF
Building real time analytics applications using pinot : A LinkedIn case study
PDF
History of Apache Pinot
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PDF
Working with OpenStreetMap using Apache Spark and Geotrellis
PPTX
Linked Data Notifications for RDF Streams
PPTX
First impressions of SparkR: our own machine learning algorithm
PDF
Introduction to SparkR
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
PDF
The Netflix data platform: Now and in the future by Kurt Brown
PPTX
Scaling Graphite At Yelp
PDF
Curse of Cardinality: A History and Evolution of Monitoring at Scale
Artmosphere Demo
Hw09 Hadoop Applications At Yahoo!
Monitoring pg with_graphite_grafana
Graph Processing with Apache TinkerPop
Building Scalable Aggregation Systems
Latest Developments in Oceanographic Applications of GIS, including Near-real...
InfluxDb and Grafana fighting with data
Need for Time series Database
Building real time analytics applications using pinot : A LinkedIn case study
History of Apache Pinot
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Working with OpenStreetMap using Apache Spark and Geotrellis
Linked Data Notifications for RDF Streams
First impressions of SparkR: our own machine learning algorithm
Introduction to SparkR
SparkR: Enabling Interactive Data Science at Scale
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
The Netflix data platform: Now and in the future by Kurt Brown
Scaling Graphite At Yelp
Curse of Cardinality: A History and Evolution of Monitoring at Scale
Ad

Viewers also liked (6)

PDF
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
PDF
Debugging node in prod
PPTX
Recommendation at Netflix Scale
PDF
Talks@Coursera - A/B Testing @ Internet Scale
PPTX
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
PDF
A/B Testing Framework Design
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Debugging node in prod
Recommendation at Netflix Scale
Talks@Coursera - A/B Testing @ Internet Scale
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
A/B Testing Framework Design
Ad

Similar to Presto Meetup Talk @ FB (03/19/15) (20)

PDF
Netflix running Presto in the AWS Cloud
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PPTX
presto-at-netflix-hadoop-summit-15
PPTX
Presto @ Netflix: Interactive Queries at Petabyte Scale
PDF
Presto - Analytical Database. Overview and use cases.
PDF
Presto - Hadoop Conference Japan 2014
PDF
SQL on Hadoop in Taiwan
ODP
Presto
PDF
SQL for Everything at CWT2014
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
What's new in SQL on Hadoop and Beyond
PDF
Presto @ Zalando - Big Data Tech Warsaw 2020
PDF
Presto@Uber
PPTX
Presto for the Enterprise @ Hadoop Meetup
PDF
Presto at Hadoop Summit 2016
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Presto At Arm Treasure Data - 2019 Updates
PDF
Facebook Presto presentation
PDF
Presto at Tivo, Boston Hadoop Meetup
PDF
Presto At Treasure Data
Netflix running Presto in the AWS Cloud
Running Presto and Spark on the Netflix Big Data Platform
presto-at-netflix-hadoop-summit-15
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto - Analytical Database. Overview and use cases.
Presto - Hadoop Conference Japan 2014
SQL on Hadoop in Taiwan
Presto
SQL for Everything at CWT2014
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
What's new in SQL on Hadoop and Beyond
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto@Uber
Presto for the Enterprise @ Hadoop Meetup
Presto at Hadoop Summit 2016
Presto Summit 2018 - 02 - LinkedIn
Presto At Arm Treasure Data - 2019 Updates
Facebook Presto presentation
Presto at Tivo, Boston Hadoop Meetup
Presto At Treasure Data

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Building Integrated photovoltaic BIPV_UPV.pdf
Modernizing your data center with Dell and AMD
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025

Presto Meetup Talk @ FB (03/19/15)

  • 1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo Big Data Platform
  • 2. Outline » Big data platform @ Netflix » Why we love Presto? » Our contributions » What are we working on? » What else we need?
  • 3. Cloud Apps S3 Suro Ursula SSTable s Cassandra Aegisthus Event Data 15m Daily Dimension Data Our Data Pipeline
  • 4. Data Warehouse Service Tool s Gateways Big Data Platform Architecture Prod Clients Clusters VPCQuery Prod TestBonusProd
  • 5. » Batch jobs (Pig, Hive) » ETL jobs » reporting and other analysis » Ad-hoc queries » interactive data exploration » Looked at Impala, Redshift, Spark, and Presto Our Use Cases
  • 6. Deployment » v 0.86 » 1 coordinator (r3.4xlarge) » 250 workers (m2.4xlarge) Tooling Numbers » ~2.5K queries/day against our 10PB Hive DW on S3 » 230+ Presto users out of 300+ platform users » presto-cli, Python, R, BI tools (ODBC/JDBC), etc. » Atlas/Suro for monitoring/logging Presto @ Netflix
  • 7. Why we love Presto? » Open source » Fast » Scalable » Works well on AWS » Good integration with the Hadoop stack » ANSI SQL
  • 8. Our Contributions 24 open PRs, 60+ commits » S3 file system » multipart upload, IAM roles, retries, monitoring, etc. » Functions for complex types » Parquet » name/index-based access, type coercion, etc. » Query optimization » Various other bug fixes
  • 9. » Vectorized reader* Read based on column vectors » Predicate pushdown Use statistics to skip data » Lazy load Postpone loading the data until needed » Lazy materialization Postpone decoding the data until needed What are we Working on? Parquet Optimizations * PARQUET-
  • 10. Netflix Integration » BI tools integration » ODBC driver, Tableau web connector, etc. » Better monitoring » Ganglia ⟶ Atlas » Data lineage » Presto ⟶ Suro ⟶ Charlotte
  • 11. » Graceful cluster shrink » Better resource management » Dynamic type coercion for all file formats » Support for more Hive types (e.g., decimal) » Predictable metastore cache behavior » Big table joins similar to Hive What else we need?

Editor's Notes

  • #4: data from apps/services. event data 200b events: app logs, user activity (search event, movie detail click from website, etc.), system operational data ursula demultiplex the events into event types (~150 event types right now). latency of this ursula pipeline is 15m dimension data: subscriber data. aegisthus extracts data from cassandra which is the online backing store for netflix and writes to s3.
  • #5: mention that we have single dw on s3, spin up multiple clusters. ittle perf diff. on s3 vs hdfs as we are mostly cpu bound. http://netflix.github.io/ sting: reporting charlotte: lineage
  • #6: impala: no s3 support spark loads all data, doesn’t stream + stability issues at that time. it couldn’t even handle an hour worth of data ~ 2013. spark sql recently graduated from alpha with the spark 1.3 release (https://spark.apache.org/releases/spark-release-1-3-0.html) redshift: need to copy data from s3 to redshift
  • #7: r3.4xlarge and m2.4xlarge are both memory optimized instances where m2 is a previous generation instance type 5PB of our 10PB Hive DW is in Parquet format
  • #8: single warehouse on s3, spin up multiple test/prod presto clusters and query live data etc.
  • #9: s3 fs: exp backoff, exposed various configs for the aws sdk, multipart upload, IAM roles, and monitoring prestoS3FileSystem and AWS sdk better tooling/community support for parquet. good integration with existing tools hive, spark, etc.. several bug fixes and new functions to manipulate complex types to close the gap between hive and presto DDL: alter/create table optimization:(2085) Rewrite Single Distinct Aggregation into GroupBy and (1937) and Optimize joins with similar subqueries complex types: array: contains, concat, sort, map: map_agg and map constructors, map_keys, map_values, etc. bridge the gap between hive and presto
  • #11: We log queries to our internal data pipeline (Suro) and another internal tool (Charlotte) analyzes data lineage
  • #12: we are pushing reporting to Presto with our Tableau/MS work. not for ETL. → monitoring, scheduling improvements. Presto’s distributed join is still memory-limited as there is no spills. hive decimal type: https://github.com/facebook/presto/issues/2417 -> at least be able to read it, still open