1
Presto - SQL on anything
January 2017
Grzegorz Kokosiński
Karol Sobczak
Teradata Center for Hadoop
2
Agenda
- Who are we?
- What is Presto?
- What is data federation?
- Different federation strategies in other databases (HIVE)
- what is supported and what are the problems
- Presto Connector
- Show time
3
Lets make some noise
• Let tweet about this presentation!
– #whug
– #prestodb
– #teradata
• Later on we will query that data!
4
Who are we
5
What is Presto?
• 100% open source distributed SQL query engine
- Originally developed by Facebook
• Key Differentiators:
- Performance & Scale
- Cross platform query capability, not only SQL on Hadoop
• Apache licensed, hosted on GitHub
- Certified distro & support from Teradata
6
Presto Users
See more at https://github.com/prestodb/presto/wiki/Presto-Users
7
• Facebook
– Multiple production clusters (100s of nodes total)
- 300PB in HDFS, sharded MySQL, SSD-based Raptor
– 1000s of internal daily active users
– 10s-100s of concurrent queries
• Netflix
– 250+ node on EC2, 40+ PB in S3 (Parquet format)
– Over 650 active users and 6K+ queries daily
• Twitter
– 200+ nodes on-premises over Parquet nested data
• Uber
– 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily
• FINRA
– 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata)
Presto in Production
8
• In-memory processing
• Pipelined execution across nodes (MPP-style)
– Vectorized columnar processing
– Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient memory management (reduced GC overhead)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
– Enables to use BI tools
Presto – Query Execution Performance
9
• Hadoop/Hive connector & file formats (HDFS/S3):
– HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Raptor
– columnar store on flash driven by Facebook
• Open source data stores (driven by the community)
– MySQL & PostgreSQL (non-parallel)
– Cassandra (by Teradata)
– Kafka
– Redis
– MongoDB
– ElasticSearch
– Accumulo (by Bloomberg)
Supported data sources & file formats
10
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition]
[ UNION [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
In addition:
• Windowing functions
• UNNEST, TABLESAMPLE
• ROLLUP, CUBE, GROUPING SETS
• UNION, EXCEPT, INTERSECT
• Subqueries (EXISTS, IN)
ANSI SQL Support
11
Presto is not a database!
• Presto is a query execution engine (storage independent)
• Pluggable custom user functionalities
– Connectors
– Functions
– Types
– System access controllers
– Resource group configuration managers
– Event listeners
– …
• Built-in core functionalities:
– parser, execution, types, sql functions, monitoring
12
Data federation
• Query data from several data sources (databases)
• Streaming
– One to One
- there is a single connection between database access points
- e.g. PSQL via PSQL
- using storage handlers to access RDBMS data from Hive
– Many to One
- many connections from one database nodes to a single access point of
other database
- Accessing REST from UDF in (possibly each) HIVE map/reduce task
– Many to Many
- workers talk to each other directly
• Through storage
– Needs (intermittent) data materialization
• Presto supports them all!
13
Data federation common problems
• model incompatibilities
• multinode streaming is not always possible
• transactions
• cost based optimizations (statistics)
• SQL pushdown (predicates, projections, aggregations?, joins?)
14
Connector
• Presto interface to access arbitrary data source (hive, mysql, jmx)
• Provides:
– metadata
– ability to distributed, parallel and streamed read/write
– transaction boundary
– physical data layouts
– statistics
– (SQL) predicate pushdown)
– indexes (index join)
– session or table properties
– access control
– procedures (CALL …
– . . .
• Most (if not all) of the above points are optional
15
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
16
Data federation with Presto
• Through the storage
• Demo
– HIVE
HDFS
DataNode
HDFS
DataNode
Hive
Metastore
HDFS
Namenode
data transfer
Presto
worker
Presto
worker
Presto
coordinator
data transfer
metadata
metadata
17
Data federation with Presto
• One to One
• Demo
– psql
– REST
– and above with HIVE
Presto
worker
Presto
worker
Presto
coordinator
SQL
Database
JDBC metadataJDBC data
18
Many to many - data federation with Presto
AMP
AMP
AMP
AMP
Q
G
E
x
c
h
a
n
g
e
Q
G
E
x
c
h
a
n
g
e
PE Coordinator
Worker Thread
Worker Thread
Worker Thread
Worker Thread
Init & metadata exchange
Bi-directional
fully parallel
data exchange
TERADATA PRESTO
• Key features:
• Low latency
• High performance
• Concurrency
• SQL pushdown
• Data conversion
• Compression
• Efficient CPU usage
19
Conclusion
• Presto Connector is expressive
• 3rd party data source is 1st class citizen
• Single ANSI SQL to rule them all
– use BI tools on data which is not BI friendly
• Rapid data integration
20
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:
www.github.com/prestodb/presto
www.github.com/Teradata/presto
More information
21
www.teradata.com/presto

More Related Content

ODP
Presto
PDF
Introduction to Presto at Treasure Data
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Presto at Hadoop Summit 2016
PDF
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
PDF
Presto @ Facebook: Past, Present and Future
PDF
Internals of Presto Service
PDF
Understanding Presto - Presto meetup @ Tokyo #1
Presto
Introduction to Presto at Treasure Data
Prestogres, ODBC & JDBC connectivity for Presto
Presto at Hadoop Summit 2016
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto @ Facebook: Past, Present and Future
Internals of Presto Service
Understanding Presto - Presto meetup @ Tokyo #1

What's hot (20)

PDF
Presto At Treasure Data
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Boston Hadoop Meetup: Presto for the Enterprise
PDF
Presto
PDF
20140120 presto meetup_en
PDF
Presto+MySQLで分散SQL
PDF
Presto as a Service - Tips for operation and monitoring
PDF
Presto updates to 0.178
PDF
Presto anatomy
PPTX
How to ensure Presto scalability 
in multi use case
PDF
Facebook Presto presentation
PDF
Prestogres internals
PPTX
Presto: Distributed sql query engine
PDF
Presto in the cloud
PPTX
Expand data analysis tool at scale with Zeppelin
PPTX
Membase Meetup 2010
PPTX
Presto Meetup 2016 Small Start
PDF
Tale of ISUCON and Its Bench Tools
PPTX
Getting started with postgresql
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Presto At Treasure Data
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Boston Hadoop Meetup: Presto for the Enterprise
Presto
20140120 presto meetup_en
Presto+MySQLで分散SQL
Presto as a Service - Tips for operation and monitoring
Presto updates to 0.178
Presto anatomy
How to ensure Presto scalability 
in multi use case
Facebook Presto presentation
Prestogres internals
Presto: Distributed sql query engine
Presto in the cloud
Expand data analysis tool at scale with Zeppelin
Membase Meetup 2010
Presto Meetup 2016 Small Start
Tale of ISUCON and Its Bench Tools
Getting started with postgresql
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Ad

Similar to Presto - SQL on anything (20)

PDF
Presto Strata Hadoop SJ 2016 short talk
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
SQL on Hadoop for the Oracle Professional
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PDF
Gunther hagleitner:apache hive & stinger
PPTX
Hive big-data meetup
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PDF
Denodo Partner Connect: Technical Webinar - Ask Me Anything
PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
PPTX
Twitter with hadoop for oow
PPTX
02 data warehouse applications with hive
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
PPTX
Apache Drill at ApacheCon2014
PPTX
Modernizing Your Data Warehouse using APS
PDF
Apache Hadoop 1.1
PPTX
Apache drill
PPT
Hive Evolution: ApacheCon NA 2010
PPTX
No sql and sql - open analytics summit
Presto Strata Hadoop SJ 2016 short talk
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
SQL on Hadoop for the Oracle Professional
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hopsworks in the cloud Berlin Buzzwords 2019
Gunther hagleitner:apache hive & stinger
Hive big-data meetup
Big Data Developers Moscow Meetup 1 - sql on hadoop
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Twitter with hadoop for oow
02 data warehouse applications with hive
Big Data Analytics with Hadoop, MongoDB and SQL Server
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Apache Drill at ApacheCon2014
Modernizing Your Data Warehouse using APS
Apache Hadoop 1.1
Apache drill
Hive Evolution: ApacheCon NA 2010
No sql and sql - open analytics summit
Ad

Recently uploaded (20)

PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
Introduction to Windows Operating System
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
E-Commerce Website Development Companyin india
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PPTX
Python is a high-level, interpreted programming language
PDF
Guide to Food Delivery App Development.pdf
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
Airline CRS | Airline CRS Systems | CRS System
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PPTX
CNN LeNet5 Architecture: Neural Networks
Matchmaking for JVMs: How to Pick the Perfect GC Partner
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Introduction to Windows Operating System
Full-Stack Developer Courses That Actually Land You Jobs
Topaz Photo AI Crack New Download (Latest 2025)
E-Commerce Website Development Companyin india
Wondershare Recoverit Full Crack New Version (Latest 2025)
How Tridens DevSecOps Ensures Compliance, Security, and Agility
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
Python is a high-level, interpreted programming language
Guide to Food Delivery App Development.pdf
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Airline CRS | Airline CRS Systems | CRS System
GSA Content Generator Crack (2025 Latest)
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
CNN LeNet5 Architecture: Neural Networks

Presto - SQL on anything

  • 1. 1 Presto - SQL on anything January 2017 Grzegorz Kokosiński Karol Sobczak Teradata Center for Hadoop
  • 2. 2 Agenda - Who are we? - What is Presto? - What is data federation? - Different federation strategies in other databases (HIVE) - what is supported and what are the problems - Presto Connector - Show time
  • 3. 3 Lets make some noise • Let tweet about this presentation! – #whug – #prestodb – #teradata • Later on we will query that data!
  • 5. 5 What is Presto? • 100% open source distributed SQL query engine - Originally developed by Facebook • Key Differentiators: - Performance & Scale - Cross platform query capability, not only SQL on Hadoop • Apache licensed, hosted on GitHub - Certified distro & support from Teradata
  • 6. 6 Presto Users See more at https://github.com/prestodb/presto/wiki/Presto-Users
  • 7. 7 • Facebook – Multiple production clusters (100s of nodes total) - 300PB in HDFS, sharded MySQL, SSD-based Raptor – 1000s of internal daily active users – 10s-100s of concurrent queries • Netflix – 250+ node on EC2, 40+ PB in S3 (Parquet format) – Over 650 active users and 6K+ queries daily • Twitter – 200+ nodes on-premises over Parquet nested data • Uber – 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily • FINRA – 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata) Presto in Production
  • 8. 8 • In-memory processing • Pipelined execution across nodes (MPP-style) – Vectorized columnar processing – Multithreaded execution keeps all CPU cores busy • Presto is written in highly tuned Java – Efficient memory management (reduced GC overhead) – Very careful coding of inner loops – Runtime bytecode generation • Optimized ORC & Parquet readers • Excellent performance with interactive SQL analytics – Enables to use BI tools Presto – Query Execution Performance
  • 9. 9 • Hadoop/Hive connector & file formats (HDFS/S3): – HDFS & S3 + HCatalog – ORC, RCFile, Parquet, SequenceFile, Text • Raptor – columnar store on flash driven by Facebook • Open source data stores (driven by the community) – MySQL & PostgreSQL (non-parallel) – Cassandra (by Teradata) – Kafka – Redis – MongoDB – ElasticSearch – Accumulo (by Bloomberg) Supported data sources & file formats
  • 10. 10 [ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ] In addition: • Windowing functions • UNNEST, TABLESAMPLE • ROLLUP, CUBE, GROUPING SETS • UNION, EXCEPT, INTERSECT • Subqueries (EXISTS, IN) ANSI SQL Support
  • 11. 11 Presto is not a database! • Presto is a query execution engine (storage independent) • Pluggable custom user functionalities – Connectors – Functions – Types – System access controllers – Resource group configuration managers – Event listeners – … • Built-in core functionalities: – parser, execution, types, sql functions, monitoring
  • 12. 12 Data federation • Query data from several data sources (databases) • Streaming – One to One - there is a single connection between database access points - e.g. PSQL via PSQL - using storage handlers to access RDBMS data from Hive – Many to One - many connections from one database nodes to a single access point of other database - Accessing REST from UDF in (possibly each) HIVE map/reduce task – Many to Many - workers talk to each other directly • Through storage – Needs (intermittent) data materialization • Presto supports them all!
  • 13. 13 Data federation common problems • model incompatibilities • multinode streaming is not always possible • transactions • cost based optimizations (statistics) • SQL pushdown (predicates, projections, aggregations?, joins?)
  • 14. 14 Connector • Presto interface to access arbitrary data source (hive, mysql, jmx) • Provides: – metadata – ability to distributed, parallel and streamed read/write – transaction boundary – physical data layouts – statistics – (SQL) predicate pushdown) – indexes (index join) – session or table properties – access control – procedures (CALL … – . . . • Most (if not all) of the above points are optional
  • 15. 15 Presto Architecture Data stream API Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 16. 16 Data federation with Presto • Through the storage • Demo – HIVE HDFS DataNode HDFS DataNode Hive Metastore HDFS Namenode data transfer Presto worker Presto worker Presto coordinator data transfer metadata metadata
  • 17. 17 Data federation with Presto • One to One • Demo – psql – REST – and above with HIVE Presto worker Presto worker Presto coordinator SQL Database JDBC metadataJDBC data
  • 18. 18 Many to many - data federation with Presto AMP AMP AMP AMP Q G E x c h a n g e Q G E x c h a n g e PE Coordinator Worker Thread Worker Thread Worker Thread Worker Thread Init & metadata exchange Bi-directional fully parallel data exchange TERADATA PRESTO • Key features: • Low latency • High performance • Concurrency • SQL pushdown • Data conversion • Compression • Efficient CPU usage
  • 19. 19 Conclusion • Presto Connector is expressive • 3rd party data source is 1st class citizen • Single ANSI SQL to rule them all – use BI tools on data which is not BI friendly • Rapid data integration
  • 20. 20 Certified Distro: www.teradata.com/presto Website: www.prestodb.io Presto Users Group: www.groups.google.com/group/presto-users GitHub: www.github.com/prestodb/presto www.github.com/Teradata/presto More information