SlideShare a Scribd company logo
© 2ndQuadrant 2016
Big Data & PostgreSQL
Using TABLESAMPLE to Analyze
Very Large Datasets
By Umair Shahid
© 2ndQuadrant 2016
Who am I?
● Got “pushed” into PostgreSQL in 2004, ended
up falling in love with it
● Not a hardcore techie, yet passionate about
open source software
● Heading the productization efforts at
2ndQuadrant
● Interested in Big Data, specifically the newer
PostgreSQL features supporting it
© 2ndQuadrant 2016
What is the problem?
Number of Rows Size on Disk (MB) Time Taken (ms)
1k 0.23 219.706
100k 24 1,302.135
1M 195 7,696.386
5M 951 40,691.603
10M 1,923 60,012.457
100M 19,456 801,493.319
© 2ndQuadrant 2016
Why is this significant?
● Data mining has typically been a painful process
● Major contributor to the pain has been the time it
takes for queries to return
● Many false steps before the required data is
identified
● Waiting time is wasted time
● Sampling, count based or time based, reduces
the wasted time significantly
© 2ndQuadrant 2016
What is TABLESAMPLE?
● Ability to read a random sample of
data in a table
● Defined in SQL:2003 (5th revision of
SQL)
● Implemented in PostgreSQL 9.5
© 2ndQuadrant 2016
Syntax
SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] )
[ REPEATABLE ( seed ) ]
...
© 2ndQuadrant 2016
sampling_method
● argument is percentage of rows
● SYSTEM
○ Block level sampling
○ Very fast
○ Non-independent rows
● BERNOULLI
○ Row level sampling
○ Slower than SYSTEM
○ Independent rows (uniformly random)
© 2ndQuadrant 2016
© 2ndQuadrant 2016
Demo sampling methods
© 2ndQuadrant 2016
REPEATABLE results
● (Reminder: [ REPEATABLE ( seed ) ])
● Optional argument
● Used if random, yet repeatable results are
required
● seed and argument need to be the same to
produce repeatable results
● Any changes made to the table will result in a
different data set
© 2ndQuadrant 2016
Now it gets interesting …
● TABLESAMPLE allows for additional sampling methods
via extensions
● tsm_system_time specifies max number of
milliseconds to spend reading a table
● Implements the syntax:
SELECT select_expression
FROM table_name
TABLESAMPLE SYSTEM_TIME (argument)
© 2ndQuadrant 2016
Demo tsm_system_time
© 2ndQuadrant 2016
Enter Orange ...
● Funded by AXLE (http:
//axleproject.eu)
● Same project funded
TABLESAMPLE
● Available integrated
with PostgreSQL in
2UDA (http:
//2ndquadrant.
com/2uda)
● Uses TABLESAMPLE
to very quickly create
visualizations for data
● Can quickly create
predictive models
© 2ndQuadrant 2016
Demo Orange
You can find a very helpful tutorial at
http://2ndquadrant.com/2uda
© 2ndQuadrant 2016
Other Big Data features in PostgreSQL
● JSON & JSONB
● HSTORE
● XML
● Scale-out by partitioning
○ Check out Postgres-XL (http://www.
postgres-xl.org/)
● etc ...
© 2ndQuadrant 2016
Umair Shahid
Email: umair.shahid@2ndQuadrant.com
Twitter: @pg_umair
2ndQuadrant is hiring - All geographies!
Thank you for your time!

More Related Content

PDF
PostgreSQL Enterprise Class Features and Capabilities
PDF
Go faster with_native_compilation Part-2
PDF
PostgreSQL 9.6 Performance-Scalability Improvements
PDF
Query Parallelism in PostgreSQL: What's coming next?
PDF
Oracle to Postgres Migration - part 2
PDF
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
PDF
Lessons PostgreSQL learned from commercial databases, and didn’t
PDF
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
PostgreSQL Enterprise Class Features and Capabilities
Go faster with_native_compilation Part-2
PostgreSQL 9.6 Performance-Scalability Improvements
Query Parallelism in PostgreSQL: What's coming next?
Oracle to Postgres Migration - part 2
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Lessons PostgreSQL learned from commercial databases, and didn’t
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)

What's hot (20)

PDF
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
PDF
Migration From Oracle to PostgreSQL
PDF
Oracle to Postgres Migration - part 1
PDF
PostgreSQL Rocks Indonesia
PDF
PostgreSQL WAL for DBAs
PPTX
Low Level CPU Performance Profiling Examples
PDF
Presto updates to 0.178
PDF
Oracle to PostgreSQL migration
PDF
Case Studies on PostgreSQL
PPTX
Simple Works Best
 
PDF
Lightening Talk - PostgreSQL Worst Practices
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
PPTX
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
PPTX
Building Spark as Service in Cloud
PDF
Presto in my_use_case
PDF
Tanel Poder - Performance stories from Exadata Migrations
PDF
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PPTX
TPC-H in MongoDB
PDF
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PPTX
Inside SQL Server In-Memory OLTP
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
Migration From Oracle to PostgreSQL
Oracle to Postgres Migration - part 1
PostgreSQL Rocks Indonesia
PostgreSQL WAL for DBAs
Low Level CPU Performance Profiling Examples
Presto updates to 0.178
Oracle to PostgreSQL migration
Case Studies on PostgreSQL
Simple Works Best
 
Lightening Talk - PostgreSQL Worst Practices
In Memory Database In Action by Tanel Poder and Kerry Osborne
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
Building Spark as Service in Cloud
Presto in my_use_case
Tanel Poder - Performance stories from Exadata Migrations
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
TPC-H in MongoDB
PostgreSQL on AWS: Tips & Tricks (and horror stories)
Inside SQL Server In-Memory OLTP
Ad

Viewers also liked (20)

PDF
(Ab)using 4d Indexing
PDF
Use Case: PostGIS and Agribotics
PDF
How to teach an elephant to rock'n'roll
PDF
PostgreSQL on Amazon RDS
PDF
Go Faster With Native Compilation
PDF
Why we love pgpool-II and why we hate it!
PDF
PostgreSQL: Past present Future
PDF
Swapping Pacemaker Corosync with repmgr
PDF
There is Javascript in my SQL
PDF
Introduction to Vacuum Freezing and XID
PDF
Security Best Practices for your Postgres Deployment
PDF
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
PDF
No sql bigdata and postgresql
PDF
What's New in PostgreSQL 9.6
 
PDF
Useful PostgreSQL Extensions
 
PDF
Best Practices for Becoming an Exceptional Postgres DBA
 
PDF
Postgresql database administration volume 1
PDF
5 Steps to PostgreSQL Performance
PDF
Managing replication of PostgreSQL (Simon Riggs)
PPTX
Managing a 14 TB reporting datawarehouse with postgresql
(Ab)using 4d Indexing
Use Case: PostGIS and Agribotics
How to teach an elephant to rock'n'roll
PostgreSQL on Amazon RDS
Go Faster With Native Compilation
Why we love pgpool-II and why we hate it!
PostgreSQL: Past present Future
Swapping Pacemaker Corosync with repmgr
There is Javascript in my SQL
Introduction to Vacuum Freezing and XID
Security Best Practices for your Postgres Deployment
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
No sql bigdata and postgresql
What's New in PostgreSQL 9.6
 
Useful PostgreSQL Extensions
 
Best Practices for Becoming an Exceptional Postgres DBA
 
Postgresql database administration volume 1
5 Steps to PostgreSQL Performance
Managing replication of PostgreSQL (Simon Riggs)
Managing a 14 TB reporting datawarehouse with postgresql
Ad

Similar to Big Data and PostgreSQL (20)

PDF
Scaling Monitoring At Databricks From Prometheus to M3
PDF
Application of postgre sql to large social infrastructure
PDF
Enabling presto to handle massive scale at lightning speed
PDF
Production-Ready BIG ML Workflows - from zero to hero
PDF
Enabling Presto to handle massive scale at lightning speed
PDF
CDC patterns in Apache Kafka®
PPTX
Distributed System explained (with Java Microservices)
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
PPTX
Job Queues Overview
PDF
Tubular Labs - Using Elastic to Search Over 2.5B Videos
PDF
Improving DragonFly's performance with PostgreSQL by Francois Tigeot
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Experiences testing dev versions of MySQL and why it is good for you
PDF
You might be paying too much for BigQuery
PDF
Travelling in time with SQL Server 2016 - Damian Widera
PDF
Scaling FreeSWITCH Performance
PDF
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
PDF
Technical Introduction to PostgreSQL and PPAS
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Lecture 2a
Scaling Monitoring At Databricks From Prometheus to M3
Application of postgre sql to large social infrastructure
Enabling presto to handle massive scale at lightning speed
Production-Ready BIG ML Workflows - from zero to hero
Enabling Presto to handle massive scale at lightning speed
CDC patterns in Apache Kafka®
Distributed System explained (with Java Microservices)
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
Job Queues Overview
Tubular Labs - Using Elastic to Search Over 2.5B Videos
Improving DragonFly's performance with PostgreSQL by Francois Tigeot
Production ready big ml workflows from zero to hero daniel marcous @ waze
Experiences testing dev versions of MySQL and why it is good for you
You might be paying too much for BigQuery
Travelling in time with SQL Server 2016 - Damian Widera
Scaling FreeSWITCH Performance
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Technical Introduction to PostgreSQL and PPAS
Machine learning and big data @ uber a tale of two systems
Lecture 2a

More from PGConf APAC (17)

PDF
PGConf APAC 2018: Sponsored Talk by Fujitsu - The growing mandatory requireme...
PDF
PGConf APAC 2018: PostgreSQL 10 - Replication goes Logical
PDF
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PDF
PGConf APAC 2018 - Lightening Talk #2 - Centralizing Authorization in PostgreSQL
PDF
Sponsored Talk @ PGConf APAC 2018 - Choosing the right partner in your Postgr...
PDF
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PDF
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PDF
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PDF
PGConf APAC 2018 - Monitoring PostgreSQL at Scale
PDF
PGConf APAC 2018 - Where's Waldo - Text Search and Pattern in PostgreSQL
PDF
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PDF
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...
PDF
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PDF
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
PDF
PGConf APAC 2018 - Tale from Trenches
PDF
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PDF
Amazon (AWS) Aurora
PGConf APAC 2018: Sponsored Talk by Fujitsu - The growing mandatory requireme...
PGConf APAC 2018: PostgreSQL 10 - Replication goes Logical
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PGConf APAC 2018 - Lightening Talk #2 - Centralizing Authorization in PostgreSQL
Sponsored Talk @ PGConf APAC 2018 - Choosing the right partner in your Postgr...
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - Monitoring PostgreSQL at Scale
PGConf APAC 2018 - Where's Waldo - Text Search and Pattern in PostgreSQL
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
Amazon (AWS) Aurora

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Big Data and PostgreSQL

  • 1. © 2ndQuadrant 2016 Big Data & PostgreSQL Using TABLESAMPLE to Analyze Very Large Datasets By Umair Shahid
  • 2. © 2ndQuadrant 2016 Who am I? ● Got “pushed” into PostgreSQL in 2004, ended up falling in love with it ● Not a hardcore techie, yet passionate about open source software ● Heading the productization efforts at 2ndQuadrant ● Interested in Big Data, specifically the newer PostgreSQL features supporting it
  • 3. © 2ndQuadrant 2016 What is the problem? Number of Rows Size on Disk (MB) Time Taken (ms) 1k 0.23 219.706 100k 24 1,302.135 1M 195 7,696.386 5M 951 40,691.603 10M 1,923 60,012.457 100M 19,456 801,493.319
  • 4. © 2ndQuadrant 2016 Why is this significant? ● Data mining has typically been a painful process ● Major contributor to the pain has been the time it takes for queries to return ● Many false steps before the required data is identified ● Waiting time is wasted time ● Sampling, count based or time based, reduces the wasted time significantly
  • 5. © 2ndQuadrant 2016 What is TABLESAMPLE? ● Ability to read a random sample of data in a table ● Defined in SQL:2003 (5th revision of SQL) ● Implemented in PostgreSQL 9.5
  • 6. © 2ndQuadrant 2016 Syntax SELECT select_expression FROM table_name TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ...
  • 7. © 2ndQuadrant 2016 sampling_method ● argument is percentage of rows ● SYSTEM ○ Block level sampling ○ Very fast ○ Non-independent rows ● BERNOULLI ○ Row level sampling ○ Slower than SYSTEM ○ Independent rows (uniformly random)
  • 9. © 2ndQuadrant 2016 Demo sampling methods
  • 10. © 2ndQuadrant 2016 REPEATABLE results ● (Reminder: [ REPEATABLE ( seed ) ]) ● Optional argument ● Used if random, yet repeatable results are required ● seed and argument need to be the same to produce repeatable results ● Any changes made to the table will result in a different data set
  • 11. © 2ndQuadrant 2016 Now it gets interesting … ● TABLESAMPLE allows for additional sampling methods via extensions ● tsm_system_time specifies max number of milliseconds to spend reading a table ● Implements the syntax: SELECT select_expression FROM table_name TABLESAMPLE SYSTEM_TIME (argument)
  • 12. © 2ndQuadrant 2016 Demo tsm_system_time
  • 13. © 2ndQuadrant 2016 Enter Orange ... ● Funded by AXLE (http: //axleproject.eu) ● Same project funded TABLESAMPLE ● Available integrated with PostgreSQL in 2UDA (http: //2ndquadrant. com/2uda) ● Uses TABLESAMPLE to very quickly create visualizations for data ● Can quickly create predictive models
  • 14. © 2ndQuadrant 2016 Demo Orange You can find a very helpful tutorial at http://2ndquadrant.com/2uda
  • 15. © 2ndQuadrant 2016 Other Big Data features in PostgreSQL ● JSON & JSONB ● HSTORE ● XML ● Scale-out by partitioning ○ Check out Postgres-XL (http://www. postgres-xl.org/) ● etc ...
  • 16. © 2ndQuadrant 2016 Umair Shahid Email: umair.shahid@2ndQuadrant.com Twitter: @pg_umair 2ndQuadrant is hiring - All geographies! Thank you for your time!