SlideShare a Scribd company logo
© 2017 Dremio Corporation @DremioHQ
Efficient data formats for analytics with
Parquet and Arrow
Julien Le Dem, Principal Architect Dremio,
VP Apache Parquet, Apache Arrow PMC
© 2017 Dremio Corporation @DremioHQ
• Architect at @DremioHQ
• Formerly Tech Lead at Twitter on Data Platforms.
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet
Julien Le Dem
@J_ Julien
© 2017 Dremio Corporation @DremioHQ
Agenda
• Community Driven Standard
• Interoperability and Ecosystem
• Benefits of Columnar representation
– On disk (Apache Parquet)
– In memory (Apache Arrow)
• Future of columnar
© 2017 Dremio Corporation @DremioHQ
Community Driven Standard
© 2017 Dremio Corporation @DremioHQ
An open source standard
• Parquet: Common need for on disk columnar.
• Arrow: Common need for in memory columnar.
• Arrow building on the success of Parquet.
• Benefits:
– Share the effort
– Create an ecosystem
• Standard from the start
© 2017 Dremio Corporation @DremioHQ
The Apache Arrow Project
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of
breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
© 2017 Dremio Corporation @DremioHQ
Interoperability and Ecosystem
© 2017 Dremio Corporation @DremioHQ
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
© 2017 Dremio Corporation @DremioHQ
Benefits of Columnar formats
@EmrgencyKittens
© 2017 Dremio Corporation @DremioHQ
Columnar layout
Logical table
representation
Row layout
Column layout
© 2017 Dremio Corporation @DremioHQ
On Disk and in Memory
• Different trade offs
– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.
• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
© 2017 Dremio Corporation @DremioHQ
Parquet on disk columnar format
© 2017 Dremio Corporation @DremioHQ
Parquet on disk columnar format
• Nested data structures
• Compact format:
– type aware encodings
– better compression
• Optimized I/O:
– Projection push down (column pruning)
– Predicate push down (filters based on stats)
© 2017 Dremio Corporation @DremioHQ
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read only the
data you need!
© 2017 Dremio Corporation @DremioHQ
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
© 2017 Dremio Corporation @DremioHQ
Arrow in memory columnar format
© 2017 Dremio Corporation @DremioHQ
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
– in execution engines, storage layers, etc.
• Interoperable
© 2017 Dremio Corporation @DremioHQ
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
© 2017 Dremio Corporation @DremioHQ
CPU pipeline
© 2017 Dremio Corporation @DremioHQ
Minimize CPU cache misses
a cache miss costs 10 to 100s cycles depending on the level
© 2017 Dremio Corporation @DremioHQ
Summary: Focus on CPU Efficiency
• Cache Locality
• Super-scalar & vectorized operation
• Minimal Structure Overhead
• Constant value access
– With minimal structure overhead
• Operate directly on columnar data
© 2017 Dremio Corporation @DremioHQ
Arrow Messages, RPC & IPC
© 2017 Dremio Corporation @DremioHQ
Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary encoded
Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
© 2017 Dremio Corporation @DremioHQ
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
© 2017 Dremio Corporation @DremioHQ
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
© 2017 Dremio Corporation @DremioHQ
Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization
• Layer TBD: Focused on supporting vectored io
– Scatter/gather reads/writes against socket
IPC
• Alpha implementation using memory mapped files
– Moving data between Python and Drill
• Working on shared allocation approach
– Shared reference counting and well-defined ownership semantics
© 2017 Dremio Corporation @DremioHQ
RPC: Single system execution
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
© 2017 Dremio Corporation @DremioHQ
Multi-system IPC
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
© 2017 Dremio Corporation @DremioHQ
Summary and Future
© 2017 Dremio Corporation @DremioHQ
Language Bindings
Parquet
• Target Languages
– Java
– CPP
– Python & Pandas
• Engines integration:
– Many!
Arrow
• Target Languages
– Java
– CPP, Python
– R (underway)
– C, Ruby, JavaScript
• Engines integration:
– Drill
– Pandas, R
– Spark (underway)
© 2017 Dremio Corporation @DremioHQ
Current activity:
• Spark Integration (SPARK-13534)
• Dictionary encoding (ARROW-542)
• Time related types finalization (ARROW-617)
• Bindings:
– C, Ruby (ARROW-631)
–JavaScript (ARROW-541)
© 2017 Dremio Corporation @DremioHQ
Results
- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)
http://s.apache.org/arrowresult1
- Streaming Arrow Performance
7.75GB/s data movement
http://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration
4GB/s reads
http://s.apache.org/arrowresult3
- Pandas Integration
9.71GB/s
http://s.apache.org/arrowresult4
© 2017 Dremio Corporation @DremioHQ
What’s Next
• Arrow RPC/REST
– Generic way to retrieve data in Arrow format
– Generic way to serve data in Arrow format
– Simplify integrations across the ecosystem
© 2017 Dremio Corporation @DremioHQ
RPC: arrow based storage interchange
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
projection/predicate
push down
Operator
Arrow batches
Storage
Mem
Disk
SQL
execution
Scanner Operator
Scanner Operator
Storage
Mem
Disk
Storage
Mem
Disk
…
© 2017 Dremio Corporation @DremioHQ
RPC: arrow based cache
The memory
representation is sent
over the wire.
No serialization
overhead.
projection
push down
Operator
Arrow-based
Cache
SQL
execution
Operator
Operator
…
© 2017 Dremio Corporation @DremioHQ
What’s Next
• Parquet – Arrow Nested support for Python & C++
• Arrow IPC Implementation
• Kudu – Arrow integration
• Apache {Spark, Drill} to Arrow Integration
– Faster UDFs, Storage interfaces
• Support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
© 2017 Dremio Corporation @DremioHQ
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}

More Related Content

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Delta lake and the delta architecture
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Accelerating Data Ingestion with Databricks Autoloader
PPTX
Introduction to Apache Spark
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Delta lake and the delta architecture
A Deep Dive into Query Execution Engine of Spark SQL
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Common Strategies for Improving Performance on Your Delta Lakehouse
Accelerating Data Ingestion with Databricks Autoloader
Introduction to Apache Spark

What's hot (20)

PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PPTX
Performance Optimizations in Apache Impala
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
PPTX
Apache Spark Fundamentals
PDF
Data Engineering Basics
PPTX
Airflow - a data flow engine
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Apache Spark Overview
PDF
How to Extend Apache Spark with Customized Optimizations
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Making Apache Spark Better with Delta Lake
PPTX
Data product thinking-Will the Data Mesh save us from analytics history
PDF
Introduction to PySpark
PDF
Introduction to Spark with Python
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PPTX
The Elastic ELK Stack
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Spark shuffle introduction
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Apache Spark Core—Deep Dive—Proper Optimization
Performance Optimizations in Apache Impala
Presto on Apache Spark: A Tale of Two Computation Engines
Apache Spark Fundamentals
Data Engineering Basics
Airflow - a data flow engine
Building Robust ETL Pipelines with Apache Spark
Apache Spark Overview
How to Extend Apache Spark with Customized Optimizations
Iceberg + Alluxio for Fast Data Analytics
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Making Apache Spark Better with Delta Lake
Data product thinking-Will the Data Mesh save us from analytics history
Introduction to PySpark
Introduction to Spark with Python
Enabling a Data Mesh Architecture with Data Virtualization
The Elastic ELK Stack
Scaling your Data Pipelines with Apache Spark on Kubernetes
Spark shuffle introduction
Ad

Similar to Efficient Data Formats for Analytics with Parquet and Arrow (20)

PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Strata NY 2017 Parquet Arrow roadmap
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPTX
Apache Arrow: In Theory, In Practice
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
HUG_Ireland_Apache_Arrow_Tomer_Shiran
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PPTX
Apache Arrow - An Overview
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PDF
(Julien le dem) parquet
PDF
Parquet Twitter Seattle open house
The columnar roadmap: Apache Parquet and Apache Arrow
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Strata London 2016: The future of column oriented data processing with Arrow ...
Mule soft mar 2017 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The columnar roadmap: Apache Parquet and Apache Arrow
Strata NY 2017 Parquet Arrow roadmap
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow: In Theory, In Practice
If you have your own Columnar format, stop now and use Parquet 😛
HUG_Ireland_Apache_Arrow_Tomer_Shiran
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Apache Arrow - An Overview
How Apache Arrow and Parquet boost cross-language interoperability
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
(Julien le dem) parquet
Parquet Twitter Seattle open house
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Efficient Data Formats for Analytics with Parquet and Arrow

  • 1. © 2017 Dremio Corporation @DremioHQ Efficient data formats for analytics with Parquet and Arrow Julien Le Dem, Principal Architect Dremio, VP Apache Parquet, Apache Arrow PMC
  • 2. © 2017 Dremio Corporation @DremioHQ • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data Platforms. • Creator of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Julien
  • 3. © 2017 Dremio Corporation @DremioHQ Agenda • Community Driven Standard • Interoperability and Ecosystem • Benefits of Columnar representation – On disk (Apache Parquet) – In memory (Apache Arrow) • Future of columnar
  • 4. © 2017 Dremio Corporation @DremioHQ Community Driven Standard
  • 5. © 2017 Dremio Corporation @DremioHQ An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow building on the success of Parquet. • Benefits: – Share the effort – Create an ecosystem • Standard from the start
  • 6. © 2017 Dremio Corporation @DremioHQ The Apache Arrow Project • New Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved – A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 7. © 2017 Dremio Corporation @DremioHQ Interoperability and Ecosystem
  • 8. © 2017 Dremio Corporation @DremioHQ High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 9. © 2017 Dremio Corporation @DremioHQ Benefits of Columnar formats @EmrgencyKittens
  • 10. © 2017 Dremio Corporation @DremioHQ Columnar layout Logical table representation Row layout Column layout
  • 11. © 2017 Dremio Corporation @DremioHQ On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  • 12. © 2017 Dremio Corporation @DremioHQ Parquet on disk columnar format
  • 13. © 2017 Dremio Corporation @DremioHQ Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  • 14. © 2017 Dremio Corporation @DremioHQ Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  • 15. © 2017 Dremio Corporation @DremioHQ Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 16. © 2017 Dremio Corporation @DremioHQ Arrow in memory columnar format
  • 17. © 2017 Dremio Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable – in execution engines, storage layers, etc. • Interoperable
  • 18. © 2017 Dremio Corporation @DremioHQ Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 19. © 2017 Dremio Corporation @DremioHQ CPU pipeline
  • 20. © 2017 Dremio Corporation @DremioHQ Minimize CPU cache misses a cache miss costs 10 to 100s cycles depending on the level
  • 21. © 2017 Dremio Corporation @DremioHQ Summary: Focus on CPU Efficiency • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar data
  • 22. © 2017 Dremio Corporation @DremioHQ Arrow Messages, RPC & IPC
  • 23. © 2017 Dremio Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  • 24. © 2017 Dremio Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 25. © 2017 Dremio Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 26. © 2017 Dremio Corporation @DremioHQ Moving Data Between Systems RPC • Avoid Serialization & Deserialization • Layer TBD: Focused on supporting vectored io – Scatter/gather reads/writes against socket IPC • Alpha implementation using memory mapped files – Moving data between Python and Drill • Working on shared allocation approach – Shared reference counting and well-defined ownership semantics
  • 27. © 2017 Dremio Corporation @DremioHQ RPC: Single system execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  • 28. © 2017 Dremio Corporation @DremioHQ Multi-system IPC SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  • 29. © 2017 Dremio Corporation @DremioHQ Summary and Future
  • 30. © 2017 Dremio Corporation @DremioHQ Language Bindings Parquet • Target Languages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) – C, Ruby, JavaScript • Engines integration: – Drill – Pandas, R – Spark (underway)
  • 31. © 2017 Dremio Corporation @DremioHQ Current activity: • Spark Integration (SPARK-13534) • Dictionary encoding (ARROW-542) • Time related types finalization (ARROW-617) • Bindings: – C, Ruby (ARROW-631) –JavaScript (ARROW-541)
  • 32. © 2017 Dremio Corporation @DremioHQ Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  • 33. © 2017 Dremio Corporation @DremioHQ What’s Next • Arrow RPC/REST – Generic way to retrieve data in Arrow format – Generic way to serve data in Arrow format – Simplify integrations across the ecosystem
  • 34. © 2017 Dremio Corporation @DremioHQ RPC: arrow based storage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  • 35. © 2017 Dremio Corporation @DremioHQ RPC: arrow based cache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  • 36. © 2017 Dremio Corporation @DremioHQ What’s Next • Parquet – Arrow Nested support for Python & C++ • Arrow IPC Implementation • Kudu – Arrow integration • Apache {Spark, Drill} to Arrow Integration – Faster UDFs, Storage interfaces • Support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  • 37. © 2017 Dremio Corporation @DremioHQ Get Involved • Join the community – dev@{arrow,parquet}.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}