SlideShare a Scribd company logo
pd.{read/to}_sql is simple but
not fast
Uwe Korn – QuantCo – November 2020
About me
• Engineering at QuantCo

• Apache {Arrow, Parquet} PMC

• Turbodbc Maintainer

• Other OSS stuff
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
Our setting
• We like tabular data

• Thus we use pandas

• We want large amounts of this data in pandas
• The traditional storage for it is SQL databases

• How do we get from one to another?
SQL
• Very very brief intro:

• „domain-specific language for accessing data held in a relational
database management system“

• The one language in data systems that precedes all the Python, R,
Julia, … we use as our „main“ language, also much wider user
base

• SELECT * FROM table

INSERT INTO table
• Two main arguments:

• sql: SQL query to be executed or a table name.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Two main arguments:

• name: Name of SQL table.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Let’s look at the other nice bits („additional arguments“)

• if_exists: „What should we do when the target already exists?“

• fail

• replace

• append
• index: „What should we with this one magical column?“ (bool)

• index_label

• chunksize: „Write less data at once“

• dtype: „What should we with this one magical column?“ (bool)

• method: „Supply some magic insertion hook“ (callable)
SQLAlchemy
• SQLAlchemy is a Python SQL toolkit and Object Relational Mapper
(ORM)

• We only use the toolkit part for:

• Metadata about schema and tables (incl. creation)

• Engine for connecting to various databases using a uniform
interface
Under the bonnet
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
How does it work (read_sql)?
• pandas.read_sql [1] calls SQLDatabase.read_query [2]

• This then does



• Depending on whether a chunksize was given, this fetches all or
parts of the result
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
How does it work (read_sql)?
• Passes in the data into the from_records constructor


• Optionally parses dates and sets an index
How does it work (to_sql)?
• This is more tricky as we modify the database.

• to_sql [1] may need to create the target

• If not existing, it will call CREATE TABLE [2]

• Afterwards, we INSERT [3] into the (new) table

• The insertion step is where we convert from DataFrame back into
records [4]



[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
[3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
[4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
Why is it slow?
No benchmarks yet, theory first.



















Why is it slow?
Thanks
Slides will come after PyData Global

Follow me on Twitter: @xhochy
How to get fast?
ODBC
• Open Database Connectivity (ODBC) is a standard API for accessing
databases

• Most databases provide an ODBC interface, some of them are
efficient

• Two popular Python libraries for that:

• https://github.com/mkleehammer/pyodbc

• https://github.com/blue-yonder/turbodbc
ODBC
Turbodbc has support for Apache Arrow: https://arrow.apache.org/
blog/2017/06/16/turbodbc-arrow/
ODBC
• With turbodbc + Arrow we get the following performance
improvements:

• 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208

• 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
Snowflake
• Turbodbc is a solution that retrofits performance

• Snowflake drivers already come with built-in speed

• Default response is JSON-based, BUT:

• The database server can answer directly with Arrow

• Client only needs the Arrow->pandas conversion (lightning fast⚡)

• Up to 10x faster, see https://www.snowflake.com/blog/fetching-
query-results-from-snowflake-just-got-a-lot-faster-with-apache-
arrow/
JDBC
• Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-
access-in-python-using-pyarrow-jvm.html

• Not yet so convenient and read-only

• First, you need all your Java dependencies incl arrow-jdbc in your
classpath

• Start JVM and load the driver, setup Arrow Java
JDBC
• Then:

• Fetch result using the Arrow Java JDBC adapter

• Use pyarrow.jvm to get a Python reference to the JVM memory

• Convert to pandas 136x speedup!
Postgres
Not yet opensourced but this is how it works:
How do we get this
into pandas.read_sql?
API troubles
• pandas’ simple API: 



• turbodbc

API troubles
• pandas’ simple API: 



• Snowflake

API troubles
• pandas’ simple API: 



• pyarrow.jvm + JDBC

Building a better API
• We want to use pandas’ simple API but with the nice performance
benefits

• One idea: Dispatching based on the connection class



• User doesn’t need to learn a new API

• Performance improvements come via optional packages

Building a better API
Alternative idea:
Building a better API
Discussion in https://github.com/pandas-dev/pandas/issues/36893
Thanks
Follow me on Twitter: @xhochy

More Related Content

PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Extending Pandas using Apache Arrow and Numba
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Enabling Python to be a Better Big Data Citizen
PDF
Presto
PDF
Improving data interoperability in Python and R
PDF
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Extending Pandas using Apache Arrow and Numba
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Apache Arrow: Cross-language Development Platform for In-memory Data
Enabling Python to be a Better Big Data Citizen
Presto
Improving data interoperability in Python and R
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera

What's hot (20)

PDF
Presto as a Service - Tips for operation and monitoring
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PDF
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
LuceneRDD for (Geospatial) Search and Entity Linkage
PDF
Pandas/Data Analysis at Baypiggies
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
PDF
Data Engineering with Solr and Spark
PDF
DataFrames: The Extended Cut
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PPTX
Rust & Apache Arrow @ RMS
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Presto as a Service - Tips for operation and monitoring
Robust and Scalable ETL over Cloud Storage with Apache Spark
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Prestogres, ODBC & JDBC connectivity for Presto
Resource-Efficient Deep Learning Model Selection on Apache Spark
LuceneRDD for (Geospatial) Search and Entity Linkage
Pandas/Data Analysis at Baypiggies
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Data Engineering with Solr and Spark
DataFrames: The Extended Cut
Performant data processing with PySpark, SparkR and DataFrame API
Rust & Apache Arrow @ RMS
Apache Arrow -- Cross-language development platform for in-memory data
Apache Spark MLlib 2.0 Preview: Data Science and Production
An Incomplete Data Tools Landscape for Hackers in 2015
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Ad

Similar to pandas.(to/from)_sql is simple but not fast (20)

PDF
COUG_AAbate_Oracle_Database_12c_New_Features
PDF
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PDF
Migration From Oracle to PostgreSQL
PDF
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Fundamentals of performance tuning PHP on IBM i
PDF
Oracle to Postgres Migration - part 1
PDF
Breaking data
PPTX
Oracle OpenWo2014 review part 03 three_paa_s_database
PPTX
Presto Meetup 2016 Small Start
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PPTX
Profiling and Tuning a Web Application - The Dirty Details
PDF
In-memory ColumnStore Index
PPTX
Drupal meets PostgreSQL for DrupalCamp MSK 2014
PPTX
SharePoint 2013 Performance Analysis - Robi Vončina
PDF
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
PDF
Best practices-wordpress-enterprise
PPTX
Php reports sumit
PDF
Logs aggregation and analysis
PDF
Prestogres internals
COUG_AAbate_Oracle_Database_12c_New_Features
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
Migration From Oracle to PostgreSQL
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Fundamentals of performance tuning PHP on IBM i
Oracle to Postgres Migration - part 1
Breaking data
Oracle OpenWo2014 review part 03 three_paa_s_database
Presto Meetup 2016 Small Start
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Profiling and Tuning a Web Application - The Dirty Details
In-memory ColumnStore Index
Drupal meets PostgreSQL for DrupalCamp MSK 2014
SharePoint 2013 Performance Analysis - Robi Vončina
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Best practices-wordpress-enterprise
Php reports sumit
Logs aggregation and analysis
Prestogres internals
Ad

More from Uwe Korn (10)

PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Going beyond Apache Parquet's default settings
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
Scalable Scientific Computing with Dask
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PyData Sofia May 2024 - Intro to Apache Arrow
Going beyond Apache Parquet's default settings
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Scalable Scientific Computing with Dask
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
How Apache Arrow and Parquet boost cross-language interoperability

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PDF
Foundation of Data Science unit number two notes
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
Supervised vs unsupervised machine learning algorithms
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Reliability_Chapter_ presentation 1221.5784
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
Foundation of Data Science unit number two notes
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx

pandas.(to/from)_sql is simple but not fast

  • 1. pd.{read/to}_sql is simple but not fast Uwe Korn – QuantCo – November 2020
  • 2. About me • Engineering at QuantCo • Apache {Arrow, Parquet} PMC • Turbodbc Maintainer • Other OSS stuff @xhochy @xhochy mail@uwekorn.com https://uwekorn.com
  • 3. Our setting • We like tabular data • Thus we use pandas • We want large amounts of this data in pandas • The traditional storage for it is SQL databases • How do we get from one to another?
  • 4. SQL • Very very brief intro: • „domain-specific language for accessing data held in a relational database management system“ • The one language in data systems that precedes all the Python, R, Julia, … we use as our „main“ language, also much wider user base • SELECT * FROM table
 INSERT INTO table
  • 5. • Two main arguments: • sql: SQL query to be executed or a table name. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 6. • Two main arguments: • name: Name of SQL table. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 7. • Let’s look at the other nice bits („additional arguments“) • if_exists: „What should we do when the target already exists?“ • fail • replace • append
  • 8. • index: „What should we with this one magical column?“ (bool) • index_label • chunksize: „Write less data at once“ • dtype: „What should we with this one magical column?“ (bool) • method: „Supply some magic insertion hook“ (callable)
  • 9. SQLAlchemy • SQLAlchemy is a Python SQL toolkit and Object Relational Mapper (ORM) • We only use the toolkit part for: • Metadata about schema and tables (incl. creation) • Engine for connecting to various databases using a uniform interface
  • 15. How does it work (read_sql)? • pandas.read_sql [1] calls SQLDatabase.read_query [2] • This then does
 • Depending on whether a chunksize was given, this fetches all or parts of the result [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
  • 16. How does it work (read_sql)? • Passes in the data into the from_records constructor • Optionally parses dates and sets an index
  • 17. How does it work (to_sql)? • This is more tricky as we modify the database. • to_sql [1] may need to create the target • If not existing, it will call CREATE TABLE [2] • Afterwards, we INSERT [3] into the (new) table • The insertion step is where we convert from DataFrame back into records [4]
 
 [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393 [3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398 [4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
  • 18. Why is it slow? No benchmarks yet, theory first.
 
 
 
 
 
 
 
 
 

  • 19. Why is it slow?
  • 20. Thanks Slides will come after PyData Global Follow me on Twitter: @xhochy How to get fast?
  • 21. ODBC • Open Database Connectivity (ODBC) is a standard API for accessing databases • Most databases provide an ODBC interface, some of them are efficient • Two popular Python libraries for that: • https://github.com/mkleehammer/pyodbc • https://github.com/blue-yonder/turbodbc
  • 22. ODBC Turbodbc has support for Apache Arrow: https://arrow.apache.org/ blog/2017/06/16/turbodbc-arrow/
  • 23. ODBC • With turbodbc + Arrow we get the following performance improvements: • 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208 • 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
  • 24. Snowflake • Turbodbc is a solution that retrofits performance • Snowflake drivers already come with built-in speed • Default response is JSON-based, BUT: • The database server can answer directly with Arrow • Client only needs the Arrow->pandas conversion (lightning fast⚡) • Up to 10x faster, see https://www.snowflake.com/blog/fetching- query-results-from-snowflake-just-got-a-lot-faster-with-apache- arrow/
  • 25. JDBC • Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc- access-in-python-using-pyarrow-jvm.html • Not yet so convenient and read-only • First, you need all your Java dependencies incl arrow-jdbc in your classpath • Start JVM and load the driver, setup Arrow Java
  • 26. JDBC • Then: • Fetch result using the Arrow Java JDBC adapter • Use pyarrow.jvm to get a Python reference to the JVM memory • Convert to pandas 136x speedup!
  • 27. Postgres Not yet opensourced but this is how it works:
  • 28. How do we get this into pandas.read_sql?
  • 29. API troubles • pandas’ simple API: 
 • turbodbc

  • 30. API troubles • pandas’ simple API: 
 • Snowflake

  • 31. API troubles • pandas’ simple API: 
 • pyarrow.jvm + JDBC

  • 32. Building a better API • We want to use pandas’ simple API but with the nice performance benefits • One idea: Dispatching based on the connection class
 • User doesn’t need to learn a new API • Performance improvements come via optional packages

  • 33. Building a better API Alternative idea:
  • 34. Building a better API Discussion in https://github.com/pandas-dev/pandas/issues/36893
  • 35. Thanks Follow me on Twitter: @xhochy