SlideShare a Scribd company logo
Presto @ Netflix: Interactive Queries
at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo
Big Data Platform
Outline
» Big data platform @ Netflix
» Why we love Presto?
» Our contributions
» What are we working on?
» What else we need?
Cloud
Apps
S3
Suro Ursula
SSTable
s
Cassandra Aegisthus
Event Data
15m
Daily
Dimension Data
Our Data Pipeline
Data
Warehouse
Service
Tool
s
Gateways
Big Data Platform Architecture
Prod
Clients
Clusters
VPCQuery Prod TestBonusProd
» Batch jobs (Pig, Hive)
» ETL jobs
» reporting and other analysis
» Ad-hoc queries
» interactive data exploration
» Looked at Impala, Redshift, Spark, and Presto
Our Use Cases
Deployment
» v 0.86
» 1 coordinator (r3.4xlarge)
» 250 workers (m2.4xlarge)
Tooling
Numbers
» ~2.5K queries/day against our 10PB Hive DW on S3
» 230+ Presto users out of 300+ platform users
» presto-cli, Python, R,
BI tools (ODBC/JDBC), etc.
» Atlas/Suro for monitoring/logging
Presto @ Netflix
Why we love Presto?
» Open source
» Fast
» Scalable
» Works well on AWS
» Good integration with the Hadoop stack
» ANSI SQL
Our Contributions
24 open PRs, 60+ commits
» S3 file system
» multipart upload, IAM roles, retries, monitoring, etc.
» Functions for complex types
» Parquet
» name/index-based access, type coercion, etc.
» Query optimization
» Various other bug fixes
» Vectorized reader* Read based on column vectors
» Predicate pushdown Use statistics to skip data
» Lazy load Postpone loading the data until needed
» Lazy materialization Postpone decoding the data until needed
What are we Working on?
Parquet Optimizations
* PARQUET-
Netflix Integration
» BI tools integration
» ODBC driver, Tableau web connector, etc.
» Better monitoring
» Ganglia ⟶ Atlas
» Data lineage
» Presto ⟶ Suro ⟶ Charlotte
» Graceful cluster shrink
» Better resource management
» Dynamic type coercion for all file formats
» Support for more Hive types (e.g., decimal)
» Predictable metastore cache behavior
» Big table joins similar to Hive
What else we need?
THANK YOU

More Related Content

PPTX
presto-at-netflix-hadoop-summit-15
PDF
Presto@Uber
PPTX
Presto: Distributed sql query engine
PDF
Presto meetup 2015-03-19 @Facebook
PDF
Presto at Hadoop Summit 2016
PPTX
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
PDF
Presto at Twitter
PDF
Presto @ Facebook: Past, Present and Future
presto-at-netflix-hadoop-summit-15
Presto@Uber
Presto: Distributed sql query engine
Presto meetup 2015-03-19 @Facebook
Presto at Hadoop Summit 2016
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Presto at Twitter
Presto @ Facebook: Past, Present and Future

What's hot (20)

PDF
Presto
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
PDF
Presto - Analytical Database. Overview and use cases.
PDF
Netflix running Presto in the AWS Cloud
PDF
Superset druid realtime
PDF
Presto Strata Hadoop SJ 2016 short talk
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PPTX
Bullet: A Real Time Data Query Engine
PDF
Presto Summit 2018 - 01 - Facebook Presto
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PPTX
Presto for the Enterprise @ Hadoop Meetup
PDF
From Batch to Streaming ET(L) with Apache Apex
ODP
Presto
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PDF
Presto @ Uber Hadoop summit2017
PDF
Real time analytics at uber @ strata data 2019
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Presto GeoSpatial @ Strata New York 2017
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
Presto
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto - Analytical Database. Overview and use cases.
Netflix running Presto in the AWS Cloud
Superset druid realtime
Presto Strata Hadoop SJ 2016 short talk
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Bullet: A Real Time Data Query Engine
Presto Summit 2018 - 01 - Facebook Presto
Rental Cars and Industrialized Learning to Rank with Sean Downes
Presto for the Enterprise @ Hadoop Meetup
From Batch to Streaming ET(L) with Apache Apex
Presto
Introduction to Data Engineer and Data Pipeline at Credit OK
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Presto @ Uber Hadoop summit2017
Real time analytics at uber @ strata data 2019
Presto Summit 2018 - 02 - LinkedIn
Presto GeoSpatial @ Strata New York 2017
Presto Summit 2018 - 09 - Netflix Iceberg
Ad

Viewers also liked (6)

PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
PDF
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
PPTX
JOSA TechTalk: Metadata Management
in Big Data
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PDF
Netflix - Enabling a Culture of Analytics
Putting Lipstick on Apache Pig at Netflix
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
JOSA TechTalk: Metadata Management
in Big Data
The evolution of the big data platform @ Netflix (OSCON 2015)
Netflix - Enabling a Culture of Analytics
Ad

Similar to Presto@Netflix Presto Meetup 03-19-15 (20)

PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PPTX
Netflix Data Engineering @ Uber Engineering Meetup
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
PDF
Using the Open Science Data Cloud for Data Science Research
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Elastic Data Analytics Platform @Datadog
PPTX
Azure Data Explorer deep dive - review 04.2020
PPTX
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PDF
Spark + AI Summit 2020 イベント概要
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Yahoo compares Storm and Spark
PDF
1 Introduction to Microsoft data platform analytics for release
PDF
USQL Trivadis Azure Data Lake Event
PDF
Big Data Modeling Challenges and Machine Learning with No Code
PPT
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
PPTX
Data saturday malta - ADX Azure Data Explorer overview
PDF
Fom io t_to_bigdata_step_by_step-final
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Netflix Data Engineering @ Uber Engineering Meetup
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Using the Open Science Data Cloud for Data Science Research
Running Presto and Spark on the Netflix Big Data Platform
Elastic Data Analytics Platform @Datadog
Azure Data Explorer deep dive - review 04.2020
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Powering Interactive BI Analytics with Presto and Delta Lake
Spark + AI Summit 2020 イベント概要
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
What's new in SQL on Hadoop and Beyond
Yahoo compares Storm and Spark
1 Introduction to Microsoft data platform analytics for release
USQL Trivadis Azure Data Lake Event
Big Data Modeling Challenges and Machine Learning with No Code
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Data saturday malta - ADX Azure Data Explorer overview
Fom io t_to_bigdata_step_by_step-final

Recently uploaded (20)

DOCX
Unit-3 cyber security network security of internet system
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
artificial intelligence overview of it and more
PPTX
Digital Literacy And Online Safety on internet
PPTX
innovation process that make everything different.pptx
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPT
tcp ip networks nd ip layering assotred slides
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
Introduction to Information and Communication Technology
PDF
Testing WebRTC applications at scale.pdf
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Unit-3 cyber security network security of internet system
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
artificial intelligence overview of it and more
Digital Literacy And Online Safety on internet
innovation process that make everything different.pptx
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
The Internet -By the Numbers, Sri Lanka Edition
tcp ip networks nd ip layering assotred slides
Module 1 - Cyber Law and Ethics 101.pptx
QR Codes Qr codecodecodecodecocodedecodecode
Decoding a Decade: 10 Years of Applied CTI Discipline
international classification of diseases ICD-10 review PPT.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Introduction to Information and Communication Technology
Testing WebRTC applications at scale.pdf
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Cloud-Scale Log Monitoring _ Datadog.pdf
Introuction about ICD -10 and ICD-11 PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...

Presto@Netflix Presto Meetup 03-19-15

  • 1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo Big Data Platform
  • 2. Outline » Big data platform @ Netflix » Why we love Presto? » Our contributions » What are we working on? » What else we need?
  • 3. Cloud Apps S3 Suro Ursula SSTable s Cassandra Aegisthus Event Data 15m Daily Dimension Data Our Data Pipeline
  • 4. Data Warehouse Service Tool s Gateways Big Data Platform Architecture Prod Clients Clusters VPCQuery Prod TestBonusProd
  • 5. » Batch jobs (Pig, Hive) » ETL jobs » reporting and other analysis » Ad-hoc queries » interactive data exploration » Looked at Impala, Redshift, Spark, and Presto Our Use Cases
  • 6. Deployment » v 0.86 » 1 coordinator (r3.4xlarge) » 250 workers (m2.4xlarge) Tooling Numbers » ~2.5K queries/day against our 10PB Hive DW on S3 » 230+ Presto users out of 300+ platform users » presto-cli, Python, R, BI tools (ODBC/JDBC), etc. » Atlas/Suro for monitoring/logging Presto @ Netflix
  • 7. Why we love Presto? » Open source » Fast » Scalable » Works well on AWS » Good integration with the Hadoop stack » ANSI SQL
  • 8. Our Contributions 24 open PRs, 60+ commits » S3 file system » multipart upload, IAM roles, retries, monitoring, etc. » Functions for complex types » Parquet » name/index-based access, type coercion, etc. » Query optimization » Various other bug fixes
  • 9. » Vectorized reader* Read based on column vectors » Predicate pushdown Use statistics to skip data » Lazy load Postpone loading the data until needed » Lazy materialization Postpone decoding the data until needed What are we Working on? Parquet Optimizations * PARQUET-
  • 10. Netflix Integration » BI tools integration » ODBC driver, Tableau web connector, etc. » Better monitoring » Ganglia ⟶ Atlas » Data lineage » Presto ⟶ Suro ⟶ Charlotte
  • 11. » Graceful cluster shrink » Better resource management » Dynamic type coercion for all file formats » Support for more Hive types (e.g., decimal) » Predictable metastore cache behavior » Big table joins similar to Hive What else we need?

Editor's Notes

  • #4: data from apps/services. event data 200b events: app logs, user activity (search event, movie detail click from website, etc.), system operational data ursula demultiplex the events into event types (~150 event types right now). latency of this ursula pipeline is 15m dimension data: subscriber data. aegisthus extracts data from cassandra which is the online backing store for netflix and writes to s3.
  • #5: mention that we have single dw on s3, spin up multiple clusters. ittle perf diff. on s3 vs hdfs as we are mostly cpu bound. http://netflix.github.io/ sting: reporting charlotte: lineage
  • #6: impala: no s3 support spark loads all data, doesn’t stream + stability issues at that time. it couldn’t even handle an hour worth of data ~ 2013. spark sql recently graduated from alpha with the spark 1.3 release (https://spark.apache.org/releases/spark-release-1-3-0.html) redshift: need to copy data from s3 to redshift
  • #7: r3.4xlarge and m2.4xlarge are both memory optimized instances where m2 is a previous generation instance type 5PB of our 10PB Hive DW is in Parquet format
  • #8: single warehouse on s3, spin up multiple test/prod presto clusters and query live data etc.
  • #9: s3 fs: exp backoff, exposed various configs for the aws sdk, multipart upload, IAM roles, and monitoring prestoS3FileSystem and AWS sdk better tooling/community support for parquet. good integration with existing tools hive, spark, etc.. several bug fixes and new functions to manipulate complex types to close the gap between hive and presto DDL: alter/create table optimization:(2085) Rewrite Single Distinct Aggregation into GroupBy and (1937) and Optimize joins with similar subqueries complex types: array: contains, concat, sort, map: map_agg and map constructors, map_keys, map_values, etc. bridge the gap between hive and presto
  • #11: We log queries to our internal data pipeline (Suro) and another internal tool (Charlotte) analyzes data lineage
  • #12: we are pushing reporting to Presto with our Tableau/MS work. not for ETL. → monitoring, scheduling improvements. Presto’s distributed join is still memory-limited as there is no spills. hive decimal type: https://github.com/facebook/presto/issues/2417 -> at least be able to read it, still open