SlideShare a Scribd company logo
© 2017 Dremio Corporation @DremioHQ
Apache Arrow: In Theory, In Practice
Apache Arrow Meetup @ Enigma
November 1, 2017
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Arrow In Theory
© 2017 Dremio Corporation @DremioHQ
The Apache Arrow Project
• Started Feb 17, 2016 (Apache tlp)
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to
choose best of breed systems
3. Designed to work with any programming
language
4. Support for both relational and complex data
as-is
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Committers & Contributors from:
© 2017 Dremio Corporation @DremioHQ
Arrow goals
• Well-documented and cross language
compatible
• Designed to take advantage of modern CPU
characteristics
• Embeddable in execution engines, storage
layers, etc.
• Interoperable
© 2017 Dremio Corporation @DremioHQ
Arrow In Memory Columnar Format
• Shredded Nested Data Structures
• Randomly Accessible
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
© 2017 Dremio Corporation @DremioHQ
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg: Parquet-to-
Arrow reader)
© 2017 Dremio Corporation @DremioHQ
Common Processing Libraries (soon)
• High Performance Canonical processing for Arrow
Data Structures
– Sort
– Hash Table
– Dictionary encoding
– Predicate application & masking
• Multiple Medium and Processing Paradigms
– Memory, NVMe, 3d Xpoint
– X86, GPU, Many Core (Phi), etc.
© 2017 Dremio Corporation @DremioHQ
Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, Map, List
• Advanced
– Union (sparse & dense)
© 2017 Dremio Corporation @DremioHQ
Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary encoded
Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
© 2017 Dremio Corporation @DremioHQ
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
© 2017 Dremio Corporation @DremioHQ
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
© 2017 Dremio Corporation @DremioHQ
Arrow Components
• Core Libraries
• Within Project Integrations
• Extended Integrations
© 2017 Dremio Corporation @DremioHQ
Arrow: Core Components
• Java Library
• C++ Library
• C Library
• Ruby Library
• Python Library
• JavaScript Library
© 2017 Dremio Corporation @DremioHQ
In-Project Arrow Building Blocks/Applications
• Plasma:
– Shared memory caching layer, originally created in Ray
• Feather:
– Fast ephemeral format for movement of data between
R/Python
• ArrowRest (soon):
– RPC/IPC interchange library (active development)
• ArrowRoutines (soon):
– Common data manipulation components
© 2017 Dremio Corporation @DremioHQ
Arrow Integrations
• Pandas
– Move seamlessly to from Arrow as a means for communication, serialization,
fast processing
• GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe
– Leverages Arrow as internal representation
• Parquet
– Read and write Parquet quickly to/from Parquet. C++ library builds directly on
Arrow.
• Spark
– Supports conversion to Pandas via Arrow construction using Arrow Java Library
• Dremio
– OSS project, Sabot Engine executes entirely on Arrow memory
© 2017 Dremio Corporation @DremioHQ
Arrow In Practice
© 2017 Dremio Corporation @DremioHQ
Real World Arrow: Sabot
• Dremio is an OSS data fabric
product
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM
© 2017 Dremio Corporation @DremioHQ
Sabot: Arrow in Practice
• Memory Management
• Vector sizing
• RPC Communication
• Filtering/Sorting
• Rowwise-algorithms: Hash Tables
• Vector-wise Algorithms
– Aggregation
– Unnesting
© 2017 Dremio Corporation @DremioHQ
Practice: Memory Management
• Arrow includes chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Support both reservation and local limits
– Include leak detection, debug ownership logs and location accounting
• Size allocators (reservation and maximum) based on workload
management, when to trigger spilling, etc.
• All Arrow Vectors hold one or more off-heap buffers
• Everything is manually reference managed
– Some code more complex
– Provides strong memory availability understanding
Root
res: 0
max: 20g
Job 1
res: 10m
max: 1g
Job 2
res: 10m
max: 1g
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
IntVector
Validity
Data
© 2017 Dremio Corporation @DremioHQ
Practice: Memory Management Cont’d
• Data moves through data pipelines
• Ownership needs to be clear (to
plan/control execution
– Allocated memory can be referenced
by many consumers
– One allocator ‘owns’ the accounted
memory
– Consumers can use Vector’s transfer
capability to leverage transfer
semantics and handoff data ownership
https://goo.gl/HN9nCH
Scan
Aggregate
Aggregate
res: 10m
max: 1g
Scan
res: 10m
max: 1g
transfer
ownership
© 2017 Dremio Corporation @DremioHQ
Practice: Vector Sizing
• Batches are the smallest work unit
• Batches of records can be 1..64k
records in size.
• Optimization Problem
– Larger improve processing
performance
– Larger causes pipeline problems
– Smaller causes more heap overhead
• Execution-Level Adaptive Resizing for
wide records (100-1000s fields)
Narrow Batch
Wide Batch
4095 records
127 records
© 2017 Dremio Corporation @DremioHQ
Practice: RPC Communication
• Goals
– Leverage Gathering Writes
– Ensure connection resilience despite
memory pressure
• Custom Netty-based RPC protocol
– All messages include structured
(proto) and sidecar memory message
– Out of memory at message
consumption time, ensuring fail-ack
as opposed to connection disconnect
Send:
Listener listener
Proto structuredMessage
ArrowBuf... dataBodies
https://goo.gl/XWyrc1
Structured message
Gathering
write
© 2017 Dremio Corporation @DremioHQ
Filtering & Sorting
• For filtering and sorting, create a selection
vector
– Describes valid values and ordering without
reorganizing underlying data.
– Two bytes for filter purposes (single batch
horizon)
– Four bytes for sort purposes (multi-batch
horizon)
• 4-Byte selection vector pattern frequently by
other operations
• 6-Byte selection vector used in some cases
(to manage wide batches)
• Defer copy/compacting
2
14
35
99
1-2
2-14
1-35
2-99
sv4
sv2
© 2017 Dremio Corporation @DremioHQ
Row-wise Algorithms: Hash Table + Aggregation
For generating hash table, maintaining a
columnar structure for keys slows hashing
insertion and lookup
• Break data into fixed and variable values
• Use consistent fixed value insertion
• Use dynamic variable output
• Pivot data
– Vector at time for fixed values
– All variable at same time for variable
vectors
• Hash and equality as bucket of bytes
• Avoids excessive indirection
• Maintain Aggregation tables in columnar
format
Fixed Block Vector Variable Block Vector
Aggregation Tables
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
validity|fixed1|fixed2|varlen|varoffset
len|data|len|data|len|data|len
|data|len|data|len|data|len|da
ta|len|data|len|data|len|data|l
en|data|len|data|len|data|len|
data|len|data|len|data
Partial-agg2
Partial-agg1
Partial-agg3
Partial-agg4
Partial-agg5
Partial-agg6
pivot fixed
pivot variable
unpivot
unpivot
direct
projection
© 2017 Dremio Corporation @DremioHQ
Example Pivot Code
• Takes advantage of runs of
nullable values, working a
word at a time
– ALL_SET, NONE_SET, SOME_SET
• Ensure canonicalization of
values based on validity
– Typically validity data is zeroed
on allocation, other vectors are
not.
– Vector data has to be cleared
when pivoting nulled values
• Conditions are avoided
static void pivot8Bytes(
VectorPivotDef def,
FixedBlockVector fixedBlock,
final int count
){
...
// decode word at a time.
while (srcDataAddr < finalWordAddr) {
final long bitValues = PlatformDependent.getLong(srcBitsAddr);
if (bitValues == NONE_SET) {
// noop (all nulls).
bitTargetAddr += (WORD_BITS * blockLength);
valueTargetAddr += (WORD_BITS * blockLength);
srcDataAddr += (WORD_BITS * EIGHT_BYTE);
} else if (bitValues == ALL_SET) {
// all set, set the bit values using a constant AND. Independently set the data values without transformation.
final int bitVal = 1 << bitOffset;
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) {
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal);
}
for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) {
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr));
}
} else {
// some nulls, some not, update each value to zero or the value, depending on the null bit.
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E
final int bitVal = ((int) (bitValues >>> i)) & 1;
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset));
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal);
}
}
srcBitsAddr += WORD_BYTES;
}
https://goo.gl/EgLy9r
© 2017 Dremio Corporation @DremioHQ
Node 1
Mux’d
Practice: Parallel Columnar Shuffle
• Partition data based on a hashed key
• Avoid excessive batch buffering cost
• Steps
1. Consolidate node-local streams
• Allow reduction in buffering memory in large
clusters (k*n instead of n*n)
2. Hash the key(s) to determine bucket offset
• Generate bucket vector
3. Pre-allocate output buffers at target output
size
• Sized depending on narrow/wide batches
4. Do columnar copies per vector
• Written in C-like low overhead pattern with
no abstraction
Node 2
Thread 1 Thread 2
generate bucket vector
Do bucket-
level copies
Gathering
Write
Thread 1 Thread 2
© 2017 Dremio Corporation @DremioHQ
Example Copier Code
• Two byte offset
addresses (sv2)
• Tight loop focused on
• Far more efficient than
runtime-generated row-
wise code
– Also has faster startup
time
public void copy(long offsetAddr, int count) {
final List<ArrowBuf> sourceBuffers = source.getFieldBuffers();
targetAlt.allocateNew(count);
final List<ArrowBuf> targetBuffers = target.getFieldBuffers();
final long max = offsetAddr + count * STEP_SIZE;
final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){
PlatformDependent.putLong(dstAddr,
PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE));
}
}
https://goo.gl/fZEsfy
© 2017 Dremio Corporation @DremioHQ
Unnesting List Vectors
• Common Pattern: List of objects that want to be
unrolled to separate records.
• Arrow’s representation allows a direct unroll (no
inner data copies required)
• Since leaf vectors can be larger (up to 2B), may
need to split apart inner vectors
– Make use of SplitAndTransfer necessary
– SplitAndTransfer as cheap as possible
• Noop for fixed data
• Offset rewrite for variable width vectors, noop for variable
data
• Bit rewrite & shifting for Validity vectors
List Vector
OffsetVector
Struct Vector
Inner Vectors
© 2017 Dremio Corporation @DremioHQ
What’s Coming
• Arrow RPC/REST
– Generic way to retrieve data in Arrow format
– Generic way to serve data in Arrow format
– Simplify integrations across the ecosystem
• Arrow Routines
– GPU and LLVM
© 2017 Dremio Corporation @DremioHQ
Get Involved
• Join the community
– dev@arrow.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ, @intjesus

More Related Content

PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Parquet performance tuning: the missing guide
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Building a Virtual Data Lake with Apache Arrow
PPTX
Apache Arrow Flight Overview
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Parquet performance tuning: the missing guide
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Building a Virtual Data Lake with Apache Arrow
Apache Arrow Flight Overview
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

What's hot (20)

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
Apache Arrow - An Overview
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Using LLVM to accelerate processing of data in Apache Arrow
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
The Apache Spark File Format Ecosystem
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Introduction to MongoDB
PPTX
Scaling for Performance
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PPTX
Apache Spark Architecture
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Apache Arrow - An Overview
Deep Dive: Memory Management in Apache Spark
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Using LLVM to accelerate processing of data in Apache Arrow
Designing Structured Streaming Pipelines—How to Architect Things Right
Deep Dive into GPU Support in Apache Spark 3.x
The columnar roadmap: Apache Parquet and Apache Arrow
Apache Iceberg - A Table Format for Hige Analytic Datasets
Performance Troubleshooting Using Apache Spark Metrics
The Apache Spark File Format Ecosystem
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
The Parquet Format and Performance Optimization Opportunities
Introduction to MongoDB
Scaling for Performance
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Spark Architecture
Ad

Viewers also liked (8)

PDF
Apache Calcite: One planner fits all
PDF
The twins that everyone loved too much
PPTX
Options for Data Prep - A Survey of the Current Market
PDF
Data Science Languages and Industry Analytics
PDF
Bi on Big Data - Strata 2016 in London
PDF
SQL on everything, in memory
PDF
Don’t optimize my queries, optimize my data!
PPTX
Apache Calcite overview
Apache Calcite: One planner fits all
The twins that everyone loved too much
Options for Data Prep - A Survey of the Current Market
Data Science Languages and Industry Analytics
Bi on Big Data - Strata 2016 in London
SQL on everything, in memory
Don’t optimize my queries, optimize my data!
Apache Calcite overview
Ad

Similar to Apache Arrow: In Theory, In Practice (20)

PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
HUG_Ireland_Apache_Arrow_Tomer_Shiran
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
PDF
GEN-Z: An Overview and Use Cases
PDF
DataFrames: The Extended Cut
PDF
Solving Cybersecurity at Scale
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Stream Processing and Real-Time Data Pipelines
PPTX
Drill at the Chicago Hug
Efficient Data Formats for Analytics with Parquet and Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata London 2016: The future of column oriented data processing with Arrow ...
HUG_Ireland_Apache_Arrow_Tomer_Shiran
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The columnar roadmap: Apache Parquet and Apache Arrow
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Improving Python and Spark Performance and Interoperability with Apache Arrow
Mule soft mar 2017 Parquet Arrow
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
GEN-Z: An Overview and Use Cases
DataFrames: The Extended Cut
Solving Cybersecurity at Scale
Next-generation Python Big Data Tools, powered by Apache Arrow
Apache Arrow -- Cross-language development platform for in-memory data
Stream Processing and Real-Time Data Pipelines
Drill at the Chicago Hug

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Nekopoi APK 2025 free lastest update
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
medical staffing services at VALiNTRY
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Understanding Forklifts - TECH EHS Solution
Upgrade and Innovation Strategies for SAP ERP Customers
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Odoo POS Development Services by CandidRoot Solutions
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ai tools demonstartion for schools and inter college
medical staffing services at VALiNTRY
ISO 45001 Occupational Health and Safety Management System
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
2025 Textile ERP Trends: SAP, Odoo & Oracle

Apache Arrow: In Theory, In Practice

  • 1. © 2017 Dremio Corporation @DremioHQ Apache Arrow: In Theory, In Practice Apache Arrow Meetup @ Enigma November 1, 2017 Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Arrow In Theory
  • 4. © 2017 Dremio Corporation @DremioHQ The Apache Arrow Project • Started Feb 17, 2016 (Apache tlp) • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R Committers & Contributors from:
  • 5. © 2017 Dremio Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc. • Interoperable
  • 6. © 2017 Dremio Corporation @DremioHQ Arrow In Memory Columnar Format • Shredded Nested Data Structures • Randomly Accessible • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 7. © 2017 Dremio Corporation @DremioHQ High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to- Arrow reader)
  • 8. © 2017 Dremio Corporation @DremioHQ Common Processing Libraries (soon) • High Performance Canonical processing for Arrow Data Structures – Sort – Hash Table – Dictionary encoding – Predicate application & masking • Multiple Medium and Processing Paradigms – Memory, NVMe, 3d Xpoint – X86, GPU, Many Core (Phi), etc.
  • 9. © 2017 Dremio Corporation @DremioHQ Arrow Data Types • Scalars – Boolean – [u]int[8,16,32,64], Decimal, Float, Double – Date, Time, Timestamp – UTF8 String, Binary • Complex – Struct, Map, List • Advanced – Union (sparse & dense)
  • 10. © 2017 Dremio Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  • 11. © 2017 Dremio Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 12. © 2017 Dremio Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 13. © 2017 Dremio Corporation @DremioHQ Arrow Components • Core Libraries • Within Project Integrations • Extended Integrations
  • 14. © 2017 Dremio Corporation @DremioHQ Arrow: Core Components • Java Library • C++ Library • C Library • Ruby Library • Python Library • JavaScript Library
  • 15. © 2017 Dremio Corporation @DremioHQ In-Project Arrow Building Blocks/Applications • Plasma: – Shared memory caching layer, originally created in Ray • Feather: – Fast ephemeral format for movement of data between R/Python • ArrowRest (soon): – RPC/IPC interchange library (active development) • ArrowRoutines (soon): – Common data manipulation components
  • 16. © 2017 Dremio Corporation @DremioHQ Arrow Integrations • Pandas – Move seamlessly to from Arrow as a means for communication, serialization, fast processing • GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe – Leverages Arrow as internal representation • Parquet – Read and write Parquet quickly to/from Parquet. C++ library builds directly on Arrow. • Spark – Supports conversion to Pandas via Arrow construction using Arrow Java Library • Dremio – OSS project, Sabot Engine executes entirely on Arrow memory
  • 17. © 2017 Dremio Corporation @DremioHQ Arrow In Practice
  • 18. © 2017 Dremio Corporation @DremioHQ Real World Arrow: Sabot • Dremio is an OSS data fabric product • The core engine is “Sabot” – Built entirely on top of Arrow libraries, runs in JVM
  • 19. © 2017 Dremio Corporation @DremioHQ Sabot: Arrow in Practice • Memory Management • Vector sizing • RPC Communication • Filtering/Sorting • Rowwise-algorithms: Hash Tables • Vector-wise Algorithms – Aggregation – Unnesting
  • 20. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management • Arrow includes chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Support both reservation and local limits – Include leak detection, debug ownership logs and location accounting • Size allocators (reservation and maximum) based on workload management, when to trigger spilling, etc. • All Arrow Vectors hold one or more off-heap buffers • Everything is manually reference managed – Some code more complex – Provides strong memory availability understanding Root res: 0 max: 20g Job 1 res: 10m max: 1g Job 2 res: 10m max: 1g Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m IntVector Validity Data
  • 21. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management Cont’d • Data moves through data pipelines • Ownership needs to be clear (to plan/control execution – Allocated memory can be referenced by many consumers – One allocator ‘owns’ the accounted memory – Consumers can use Vector’s transfer capability to leverage transfer semantics and handoff data ownership https://goo.gl/HN9nCH Scan Aggregate Aggregate res: 10m max: 1g Scan res: 10m max: 1g transfer ownership
  • 22. © 2017 Dremio Corporation @DremioHQ Practice: Vector Sizing • Batches are the smallest work unit • Batches of records can be 1..64k records in size. • Optimization Problem – Larger improve processing performance – Larger causes pipeline problems – Smaller causes more heap overhead • Execution-Level Adaptive Resizing for wide records (100-1000s fields) Narrow Batch Wide Batch 4095 records 127 records
  • 23. © 2017 Dremio Corporation @DremioHQ Practice: RPC Communication • Goals – Leverage Gathering Writes – Ensure connection resilience despite memory pressure • Custom Netty-based RPC protocol – All messages include structured (proto) and sidecar memory message – Out of memory at message consumption time, ensuring fail-ack as opposed to connection disconnect Send: Listener listener Proto structuredMessage ArrowBuf... dataBodies https://goo.gl/XWyrc1 Structured message Gathering write
  • 24. © 2017 Dremio Corporation @DremioHQ Filtering & Sorting • For filtering and sorting, create a selection vector – Describes valid values and ordering without reorganizing underlying data. – Two bytes for filter purposes (single batch horizon) – Four bytes for sort purposes (multi-batch horizon) • 4-Byte selection vector pattern frequently by other operations • 6-Byte selection vector used in some cases (to manage wide batches) • Defer copy/compacting 2 14 35 99 1-2 2-14 1-35 2-99 sv4 sv2
  • 25. © 2017 Dremio Corporation @DremioHQ Row-wise Algorithms: Hash Table + Aggregation For generating hash table, maintaining a columnar structure for keys slows hashing insertion and lookup • Break data into fixed and variable values • Use consistent fixed value insertion • Use dynamic variable output • Pivot data – Vector at time for fixed values – All variable at same time for variable vectors • Hash and equality as bucket of bytes • Avoids excessive indirection • Maintain Aggregation tables in columnar format Fixed Block Vector Variable Block Vector Aggregation Tables validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset len|data|len|data|len|data|len |data|len|data|len|data|len|da ta|len|data|len|data|len|data|l en|data|len|data|len|data|len| data|len|data|len|data Partial-agg2 Partial-agg1 Partial-agg3 Partial-agg4 Partial-agg5 Partial-agg6 pivot fixed pivot variable unpivot unpivot direct projection
  • 26. © 2017 Dremio Corporation @DremioHQ Example Pivot Code • Takes advantage of runs of nullable values, working a word at a time – ALL_SET, NONE_SET, SOME_SET • Ensure canonicalization of values based on validity – Typically validity data is zeroed on allocation, other vectors are not. – Vector data has to be cleared when pivoting nulled values • Conditions are avoided static void pivot8Bytes( VectorPivotDef def, FixedBlockVector fixedBlock, final int count ){ ... // decode word at a time. while (srcDataAddr < finalWordAddr) { final long bitValues = PlatformDependent.getLong(srcBitsAddr); if (bitValues == NONE_SET) { // noop (all nulls). bitTargetAddr += (WORD_BITS * blockLength); valueTargetAddr += (WORD_BITS * blockLength); srcDataAddr += (WORD_BITS * EIGHT_BYTE); } else if (bitValues == ALL_SET) { // all set, set the bit values using a constant AND. Independently set the data values without transformation. final int bitVal = 1 << bitOffset; for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) { PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal); } for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) { PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr)); } } else { // some nulls, some not, update each value to zero or the value, depending on the null bit. for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E final int bitVal = ((int) (bitValues >>> i)) & 1; PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset)); PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal); } } srcBitsAddr += WORD_BYTES; } https://goo.gl/EgLy9r
  • 27. © 2017 Dremio Corporation @DremioHQ Node 1 Mux’d Practice: Parallel Columnar Shuffle • Partition data based on a hashed key • Avoid excessive batch buffering cost • Steps 1. Consolidate node-local streams • Allow reduction in buffering memory in large clusters (k*n instead of n*n) 2. Hash the key(s) to determine bucket offset • Generate bucket vector 3. Pre-allocate output buffers at target output size • Sized depending on narrow/wide batches 4. Do columnar copies per vector • Written in C-like low overhead pattern with no abstraction Node 2 Thread 1 Thread 2 generate bucket vector Do bucket- level copies Gathering Write Thread 1 Thread 2
  • 28. © 2017 Dremio Corporation @DremioHQ Example Copier Code • Two byte offset addresses (sv2) • Tight loop focused on • Far more efficient than runtime-generated row- wise code – Also has faster startup time public void copy(long offsetAddr, int count) { final List<ArrowBuf> sourceBuffers = source.getFieldBuffers(); targetAlt.allocateNew(count); final List<ArrowBuf> targetBuffers = target.getFieldBuffers(); final long max = offsetAddr + count * STEP_SIZE; final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){ PlatformDependent.putLong(dstAddr, PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE)); } } https://goo.gl/fZEsfy
  • 29. © 2017 Dremio Corporation @DremioHQ Unnesting List Vectors • Common Pattern: List of objects that want to be unrolled to separate records. • Arrow’s representation allows a direct unroll (no inner data copies required) • Since leaf vectors can be larger (up to 2B), may need to split apart inner vectors – Make use of SplitAndTransfer necessary – SplitAndTransfer as cheap as possible • Noop for fixed data • Offset rewrite for variable width vectors, noop for variable data • Bit rewrite & shifting for Validity vectors List Vector OffsetVector Struct Vector Inner Vectors
  • 30. © 2017 Dremio Corporation @DremioHQ What’s Coming • Arrow RPC/REST – Generic way to retrieve data in Arrow format – Generic way to serve data in Arrow format – Simplify integrations across the ecosystem • Arrow Routines – GPU and LLVM
  • 31. © 2017 Dremio Corporation @DremioHQ Get Involved • Join the community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow, @DremioHQ, @intjesus