SlideShare a Scribd company logo
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
internals
PostgreSQL protocol gateway for Presto
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> ServerEngine - A Ruby framework to build multiprocess servers
> Prestogres - PostgreSQL protocol gateway for Presto
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
Today’s talk
1. What’s Presto?
2. Prestogres design
3. Prestogres implementation
4. Prestogres hacks
5. Prestogres future works
1. What’s Presto?
What’s Presto?
A distributed SQL query engine

for interactive data analisys

against GBs to PBs of data.
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly using
dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response

(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ Extra work to manage

2 platforms
✓ Can’t query against

“live”data directly
Batch analysis platform Visualization platform
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Data analysis platform
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets
Data analysis platform
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
Prestogres
Presto
HDFS
Dashboard
Interactive query
Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Prestogres
Today’s topic!
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity through Prestogres
> Query across multiple data sources such as

Hive, HBase, Cassandra, or even internal DBs
> Plugin mechanism
> Integrate batch analisys + visualization

into a single data analysis platform
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
Prestogres design of the ODBC/JDBC gateway
The problems to use Presto with
BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
• psqlODBC: 58,000 lines
• postgresql-jdbc: 62,000 lines
• mysql-connctor-odbc: 27,000 lines
• mysql-connector-j: 101,000 lines
A solution
> Creates a PostgreSQL protocol gateway
> Uses PostgreSQL’s stable ODBC / JDBC driver
Other possible designs were…
a) MySQL protocol + libdrizzle:
> Drizzle provides a well-designed library to implement
MySQL protocol server.
> Proof-of-concept worked well:
• trd-gateway - MySQL protocol gateway for Hive
> Difficulties: clients assumes the server is MySQL but,
• syntax mismatches: MySQL uses `…` while Presto “…”
• function mismatches: DAYOFMONTH(…) vs EXTRACT(day…)
b) PostgreSQL + Foreign Data Wrapper (FDW):
> JOIN and aggregation pushdown is not available
Other possible designs were…
c) PostgreSQL + H2 database + patch:
> H2 is an embedded database engine written in Java
> H2 has a PostgreSQL protocol implementation in Java
> Difficulties:
• System catalog implementation is incomplete

(pg_class, pg_namespace, pg_proc, etc.)
d) Reusing PostgreSQL protocol impl.:
> Difficulties:
• complete implementation of system catalogs was too difficult
Prestogres design
pgpool-II + PostgreSQL + PL/Python
> pgpool-II is a PostgreSQL protocol middleware for
replication, failover, load-balancing, etc.
> pgpool-II originally has some useful code

(parsing SQL, rewriting SQL, hacking system catalogs, …)
> Basic idea:
• Rewrite queries at pgpool-II and run Presto queries using PL/Python
select count(1)

from access
select * from

python_func(‘select count(1) from access’)
rewrite!
Prestogres implementation
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting queries Executing queries using
PL/Python
3. 4.
Overview
Patched!
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting queries Executing queries using
PL/Python
3. 4.
Overview
Patched!
Prestogres
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
StartupPacket {
database = “mydb”,
user = “me”,
…
}
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
StartupPacket {
database = “mydb”,
user = “me”,
…
}
pgpool-IIpsql PostgreSQL Presto
StartupPacket {
database = “mydb”,
user = “me”,
…
}
$ psql -U me mydb
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
> CREATE DATABASE hive;
> CREATE ROLE me;
> CREATE FUNCTION setup_system_catalog;
> CREATE FUNCTION start_presto_query;
libpq host=‘localhost’, dbname=‘postgres’,

user=‘prestogres’ (system_db)
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
StartupPacket {
database = “hive”,
user = “me”,
…
}
StartupPacket {
database = “mydb”,
user = “me”,
…
}
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class;
"Query against a system catalog!”
Meta-query
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
"Query against a system catalog!”
Meta-query
PL/Python function

defined at prestogres.py
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
> CREATE TABLE access_logs;
> CREATE TABLE users;
> CREATE TABLE events;
…
Meta-query
SELECT * FROM

information_schema.columns
"Query against a system catalog!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class;
Meta-query
"Query against a system catalog!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs;
regular table!
Presto Query
"Query against a regular table!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘select count(*) from access_logs’)
regular table!
Presto Query
"Query against a regular table!”
PL/Python function

defined at prestogres.py
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘select count(*) from access_logs’)
> CREATE TYPE result_type (c0_ BIGINT);
> CREATE FUNCTION fetch_results 

RETURNS SETOF result_type
…
regular table!
Presto Query
"Query against a regular table!”
1. start the query on Presto
2. define a function

to fetch the result
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results();
Presto Query
"Query against a regular table!”
PL/Python function

defined by start_presto_query
“Q” RAISE EXCEPTION …
Prestogres hacks
Multi-statement queries
BEGIN; SELECT 1; COMMIT;
Supporting Cursors
DECLARE CURSOR xyz FOR select …; FETCH
Security
security definer
Error message handling
raise exception ‘%’, E’…’ using errcode = …;
Faked current_database()
delete from pg_catalog.pg_proc where
proname=‘current_database’;
create function pg_catalog.current_database()

returns name as $$

begin

return ‘faked_name’::name;

end

$$ language plpgsql stable strict;
Future works
Future works
Rewriting CAST syntax
Extended query
CREATE TEMP TABLE
DROP TABLE
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!
Prestogres internals
Prestogres internals

More Related Content

PPTX
Elastic search overview
PDF
Postgresql database administration volume 1
PPTX
MongoDB
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PPTX
The Basics of MongoDB
PDF
MongoDB vs. Postgres Benchmarks
 
PDF
What is new in PostgreSQL 14?
ODP
OpenGurukul : Database : PostgreSQL
Elastic search overview
Postgresql database administration volume 1
MongoDB
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
The Basics of MongoDB
MongoDB vs. Postgres Benchmarks
 
What is new in PostgreSQL 14?
OpenGurukul : Database : PostgreSQL

What's hot (20)

PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PDF
Oracle db performance tuning
PPTX
Caching
PDF
Relational vs Non Relational Databases
PDF
Backup and recovery in oracle
PPTX
The Elastic ELK Stack
PPTX
Elastic Stack Introduction
PPTX
Elastic stack Presentation
PDF
Data engineering
PPTX
PostgreSQL Database Slides
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
ODP
Presto
PDF
Iceberg: a fast table format for S3
PDF
Hudi architecture, fundamentals and capabilities
PDF
MySQL on AWS RDS
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Storing time series data with Apache Cassandra
PDF
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PPTX
Presto: SQL-on-anything
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Oracle db performance tuning
Caching
Relational vs Non Relational Databases
Backup and recovery in oracle
The Elastic ELK Stack
Elastic Stack Introduction
Elastic stack Presentation
Data engineering
PostgreSQL Database Slides
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Presto
Iceberg: a fast table format for S3
Hudi architecture, fundamentals and capabilities
MySQL on AWS RDS
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Parquet Format and Performance Optimization Opportunities
Storing time series data with Apache Cassandra
[pgday.Seoul 2022] PostgreSQL with Google Cloud
Presto: SQL-on-anything
Ad

Similar to Prestogres internals (20)

PDF
SQL on Hadoop in Taiwan
PDF
SQL for Everything at CWT2014
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PDF
Real time analytics at uber @ strata data 2019
PDF
Presto - Hadoop Conference Japan 2014
PDF
Full-Stack Data Science: How to be a One-person Data Team
PDF
Treasure Data and OSS
PDF
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
PDF
Data science for infrastructure dev week 2022
PDF
Enterprise Data Science
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
PDF
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
PPTX
Hannes end-of-the-router-tnc17
PDF
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
pandas.(to/from)_sql is simple but not fast
PDF
Presto talk @ Global AI conference 2018 Boston
PDF
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
SQL on Hadoop in Taiwan
SQL for Everything at CWT2014
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Real time analytics at uber @ strata data 2019
Presto - Hadoop Conference Japan 2014
Full-Stack Data Science: How to be a One-person Data Team
Treasure Data and OSS
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Data science for infrastructure dev week 2022
Enterprise Data Science
Day 1 - Technical Bootcamp azure synapse analytics
Prestogres, ODBC & JDBC connectivity for Presto
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Hannes end-of-the-router-tnc17
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
pandas.(to/from)_sql is simple but not fast
Presto talk @ Global AI conference 2018 Boston
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Ad

More from Sadayuki Furuhashi (20)

PDF
Scripting Embulk Plugins
PDF
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
PDF
Making KVS 10x Scalable
PDF
Automating Workflows for Analytics Pipelines
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Fluentd at Bay Area Kubernetes Meetup
PDF
DigdagはなぜYAMLなのか?
PDF
Logging for Production Systems in The Container Era
PDF
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
PDF
Fighting Against Chaotically Separated Values with Embulk
PDF
Embulk - 進化するバルクデータローダ
PDF
Plugin-based software design with Ruby and RubyGems
PDF
Embuk internals
PDF
Embulk, an open-source plugin-based parallel bulk data loader
PDF
Understanding Presto - Presto meetup @ Tokyo #1
PDF
Presto+MySQLで分散SQL
PDF
Fluentd - Set Up Once, Collect More
PDF
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
PDF
How we use Fluentd in Treasure Data
PDF
Fluentd meetup at Slideshare
Scripting Embulk Plugins
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Making KVS 10x Scalable
Automating Workflows for Analytics Pipelines
Digdagによる大規模データ処理の自動化とエラー処理
Fluentd at Bay Area Kubernetes Meetup
DigdagはなぜYAMLなのか?
Logging for Production Systems in The Container Era
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Fighting Against Chaotically Separated Values with Embulk
Embulk - 進化するバルクデータローダ
Plugin-based software design with Ruby and RubyGems
Embuk internals
Embulk, an open-source plugin-based parallel bulk data loader
Understanding Presto - Presto meetup @ Tokyo #1
Presto+MySQLで分散SQL
Fluentd - Set Up Once, Collect More
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
How we use Fluentd in Treasure Data
Fluentd meetup at Slideshare

Recently uploaded (20)

PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
Introduction to Effective Communication.pptx
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
Human Mind & its character Characteristics
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
PPTX
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
PPTX
English-9-Q1-3-.pptxjkshbxnnxgchchxgxhxhx
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Anesthesia and it's stage with mnemonic and images
PPTX
Relationship Management Presentation In Banking.pptx
PDF
natwest.pdf company description and business model
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPT
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
Tour Presentation Educational Activity.pptx
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
Introduction to Effective Communication.pptx
Impressionism_PostImpressionism_Presentation.pptx
Human Mind & its character Characteristics
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
English-9-Q1-3-.pptxjkshbxnnxgchchxgxhxhx
nose tajweed for the arabic alphabets for the responsive
Anesthesia and it's stage with mnemonic and images
Relationship Management Presentation In Banking.pptx
natwest.pdf company description and business model
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Swiggy’s Playbook: UX, Logistics & Monetization
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
The Effect of Human Resource Management Practice on Organizational Performanc...
Emphasizing It's Not The End 08 06 2025.pptx
Instagram's Product Secrets Unveiled with this PPT
oil_refinery_presentation_v1 sllfmfls.pdf
Tour Presentation Educational Activity.pptx
Learning-Plan-5-Policies-and-Practices.pptx

Prestogres internals

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. internals PostgreSQL protocol gateway for Presto
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > ServerEngine - A Ruby framework to build multiprocess servers > Prestogres - PostgreSQL protocol gateway for Presto > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
  • 3. Today’s talk 1. What’s Presto? 2. Prestogres design 3. Prestogres implementation 4. Prestogres hacks 5. Prestogres future works
  • 5. What’s Presto? A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response
 (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 8. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ Extra work to manage
 2 platforms ✓ Can’t query against
 “live”data directly Batch analysis platform Visualization platform
  • 9. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query Data analysis platform
  • 10. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Data analysis platform
  • 11. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform Prestogres
  • 12. Presto HDFS Dashboard Interactive query Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Prestogres Today’s topic!
  • 13. dashboard on chart.io: https://chartio.com/
  • 14. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity through Prestogres > Query across multiple data sources such as
 Hive, HBase, Cassandra, or even internal DBs > Plugin mechanism > Integrate batch analisys + visualization
 into a single data analysis platform
  • 15. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 16. Prestogres design of the ODBC/JDBC gateway
  • 17. The problems to use Presto with BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time • psqlODBC: 58,000 lines • postgresql-jdbc: 62,000 lines • mysql-connctor-odbc: 27,000 lines • mysql-connector-j: 101,000 lines
  • 18. A solution > Creates a PostgreSQL protocol gateway > Uses PostgreSQL’s stable ODBC / JDBC driver
  • 19. Other possible designs were… a) MySQL protocol + libdrizzle: > Drizzle provides a well-designed library to implement MySQL protocol server. > Proof-of-concept worked well: • trd-gateway - MySQL protocol gateway for Hive > Difficulties: clients assumes the server is MySQL but, • syntax mismatches: MySQL uses `…` while Presto “…” • function mismatches: DAYOFMONTH(…) vs EXTRACT(day…) b) PostgreSQL + Foreign Data Wrapper (FDW): > JOIN and aggregation pushdown is not available
  • 20. Other possible designs were… c) PostgreSQL + H2 database + patch: > H2 is an embedded database engine written in Java > H2 has a PostgreSQL protocol implementation in Java > Difficulties: • System catalog implementation is incomplete
 (pg_class, pg_namespace, pg_proc, etc.) d) Reusing PostgreSQL protocol impl.: > Difficulties: • complete implementation of system catalogs was too difficult
  • 21. Prestogres design pgpool-II + PostgreSQL + PL/Python > pgpool-II is a PostgreSQL protocol middleware for replication, failover, load-balancing, etc. > pgpool-II originally has some useful code
 (parsing SQL, rewriting SQL, hacking system catalogs, …) > Basic idea: • Rewrite queries at pgpool-II and run Presto queries using PL/Python select count(1)
 from access select * from
 python_func(‘select count(1) from access’) rewrite!
  • 23. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched!
  • 24. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched! Prestogres
  • 25. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb StartupPacket { database = “mydb”, user = “me”, … }
  • 26. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive StartupPacket { database = “mydb”, user = “me”, … }
  • 27. pgpool-IIpsql PostgreSQL Presto StartupPacket { database = “mydb”, user = “me”, … } $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ > CREATE DATABASE hive; > CREATE ROLE me; > CREATE FUNCTION setup_system_catalog; > CREATE FUNCTION start_presto_query; libpq host=‘localhost’, dbname=‘postgres’,
 user=‘prestogres’ (system_db) prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive
  • 28. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ StartupPacket { database = “hive”, user = “me”, … } StartupPacket { database = “mydb”, user = “me”, … }
  • 29. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query
  • 30. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query PL/Python function
 defined at prestogres.py
  • 31. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; > CREATE TABLE access_logs; > CREATE TABLE users; > CREATE TABLE events; … Meta-query SELECT * FROM
 information_schema.columns "Query against a system catalog!”
  • 32. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class; Meta-query "Query against a system catalog!”
  • 33. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; regular table! Presto Query "Query against a regular table!”
  • 34. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) regular table! Presto Query "Query against a regular table!” PL/Python function
 defined at prestogres.py
  • 35. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) > CREATE TYPE result_type (c0_ BIGINT); > CREATE FUNCTION fetch_results 
 RETURNS SETOF result_type … regular table! Presto Query "Query against a regular table!” 1. start the query on Presto 2. define a function
 to fetch the result
  • 36. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results(); Presto Query "Query against a regular table!” PL/Python function
 defined by start_presto_query “Q” RAISE EXCEPTION …
  • 39. Supporting Cursors DECLARE CURSOR xyz FOR select …; FETCH
  • 41. Error message handling raise exception ‘%’, E’…’ using errcode = …;
  • 42. Faked current_database() delete from pg_catalog.pg_proc where proname=‘current_database’; create function pg_catalog.current_database()
 returns name as $$
 begin
 return ‘faked_name’::name;
 end
 $$ language plpgsql stable strict;
  • 44. Future works Rewriting CAST syntax Extended query CREATE TEMP TABLE DROP TABLE
  • 45. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!