Apache Arrow: In Theory, In Practice

© 2017 Dremio Corporation @DremioHQ
Apache Arrow: In Theory, In Practice
Apache Arrow Meetup @ Enigma
November 1, 2017
Jacques Nadeau

Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)

Arrow In Theory

The Apache Arrow Project
• Started Feb 17, 2016 (Apache tlp)
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to
choose best of breed systems
3. Designed to work with any programming
language
4. Support for both relational and complex data
as-is
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Committers & Contributors from:

Arrow goals
• Well-documented and cross language
compatible
• Designed to take advantage of modern CPU
characteristics
• Embeddable in execution engines, storage
layers, etc.
• Interoperable

Arrow In Memory Columnar Format
• Shredded Nested Data Structures
• Randomly Accessible
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O

High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg: Parquet-to-
Arrow reader)

Common Processing Libraries (soon)
• High Performance Canonical processing for Arrow
Data Structures
– Sort
– Hash Table
– Dictionary encoding
– Predicate application & masking
• Multiple Medium and Processing Paradigms
– Memory, NVMe, 3d Xpoint
– X86, GPU, Many Core (Phi), etc.

Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, Map, List
• Advanced
– Union (sparse & dense)

Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary encoded
Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches

Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]

Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire

Arrow Components
• Core Libraries
• Within Project Integrations
• Extended Integrations

Arrow: Core Components
• Java Library
• C++ Library
• C Library
• Ruby Library
• Python Library
• JavaScript Library

In-Project Arrow Building Blocks/Applications
• Plasma:
– Shared memory caching layer, originally created in Ray
• Feather:
– Fast ephemeral format for movement of data between
R/Python
• ArrowRest (soon):
– RPC/IPC interchange library (active development)
• ArrowRoutines (soon):
– Common data manipulation components

Arrow Integrations
• Pandas
– Move seamlessly to from Arrow as a means for communication, serialization,
fast processing
• GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe
– Leverages Arrow as internal representation
• Parquet
– Read and write Parquet quickly to/from Parquet. C++ library builds directly on
Arrow.
• Spark
– Supports conversion to Pandas via Arrow construction using Arrow Java Library
• Dremio
– OSS project, Sabot Engine executes entirely on Arrow memory

Arrow In Practice

Real World Arrow: Sabot
• Dremio is an OSS data fabric
product
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM

Sabot: Arrow in Practice
• Memory Management
• Vector sizing
• RPC Communication
• Filtering/Sorting
• Rowwise-algorithms: Hash Tables
• Vector-wise Algorithms
– Aggregation
– Unnesting

Practice: Memory Management
• Arrow includes chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Support both reservation and local limits
– Include leak detection, debug ownership logs and location accounting
• Size allocators (reservation and maximum) based on workload
management, when to trigger spilling, etc.
• All Arrow Vectors hold one or more off-heap buffers
• Everything is manually reference managed
– Some code more complex
– Provides strong memory availability understanding
Root
res: 0
max: 20g
Job 1
res: 10m
max: 1g
Job 2
res: 10m
max: 1g
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
Task 1
res: 1m
max: -1
Task 2
res: 5m
max: 20m
IntVector
Validity
Data

Practice: Memory Management Cont’d
• Data moves through data pipelines
• Ownership needs to be clear (to
plan/control execution
– Allocated memory can be referenced
by many consumers
– One allocator ‘owns’ the accounted
memory
– Consumers can use Vector’s transfer
capability to leverage transfer
semantics and handoff data ownership
https://goo.gl/HN9nCH
Scan
Aggregate
Aggregate
res: 10m
max: 1g
Scan
res: 10m
max: 1g
transfer
ownership

Practice: Vector Sizing
• Batches are the smallest work unit
• Batches of records can be 1..64k
records in size.
• Optimization Problem
– Larger improve processing
performance
– Larger causes pipeline problems
– Smaller causes more heap overhead
• Execution-Level Adaptive Resizing for
wide records (100-1000s fields)
Narrow Batch
Wide Batch
4095 records
127 records

Practice: RPC Communication
• Goals
– Leverage Gathering Writes
– Ensure connection resilience despite
memory pressure
• Custom Netty-based RPC protocol
– All messages include structured
(proto) and sidecar memory message
– Out of memory at message
consumption time, ensuring fail-ack
as opposed to connection disconnect
Send:
Listener listener
Proto structuredMessage
ArrowBuf... dataBodies
https://goo.gl/XWyrc1
Structured message
Gathering
write

Filtering & Sorting
• For filtering and sorting, create a selection
vector
– Describes valid values and ordering without
reorganizing underlying data.
– Two bytes for filter purposes (single batch
horizon)
– Four bytes for sort purposes (multi-batch
horizon)
• 4-Byte selection vector pattern frequently by
other operations
• 6-Byte selection vector used in some cases
(to manage wide batches)
• Defer copy/compacting
2
14
35
99
1-2
2-14
1-35
2-99
sv4
sv2

Row-wise Algorithms: Hash Table + Aggregation
For generating hash table, maintaining a
columnar structure for keys slows hashing
insertion and lookup
• Break data into fixed and variable values
• Use consistent fixed value insertion
• Use dynamic variable output
• Pivot data
– Vector at time for fixed values
– All variable at same time for variable
vectors
• Hash and equality as bucket of bytes
• Avoids excessive indirection
• Maintain Aggregation tables in columnar
format
Fixed Block Vector Variable Block Vector
Aggregation Tables
validity|fixed1|fixed2|varlen|varoffset
len|data|len|data|len|data|len
|data|len|data|len|data|len|da
ta|len|data|len|data|len|data|l
en|data|len|data|len|data|len|
data|len|data|len|data
Partial-agg2
Partial-agg1
Partial-agg3
Partial-agg4
Partial-agg5
Partial-agg6
pivot fixed
pivot variable
unpivot
unpivot
direct
projection

Example Pivot Code
• Takes advantage of runs of
nullable values, working a
word at a time
– ALL_SET, NONE_SET, SOME_SET
• Ensure canonicalization of
values based on validity
– Typically validity data is zeroed
on allocation, other vectors are
not.
– Vector data has to be cleared
when pivoting nulled values
• Conditions are avoided
static void pivot8Bytes(
VectorPivotDef def,
FixedBlockVector fixedBlock,
final int count
){
...
// decode word at a time.
while (srcDataAddr < finalWordAddr) {
final long bitValues = PlatformDependent.getLong(srcBitsAddr);
if (bitValues == NONE_SET) {
// noop (all nulls).
bitTargetAddr += (WORD_BITS * blockLength);
valueTargetAddr += (WORD_BITS * blockLength);
srcDataAddr += (WORD_BITS * EIGHT_BYTE);
} else if (bitValues == ALL_SET) {
// all set, set the bit values using a constant AND. Independently set the data values without transformation.
final int bitVal = 1 << bitOffset;
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) {
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal);
}
for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) {
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr));
}
} else {
// some nulls, some not, update each value to zero or the value, depending on the null bit.
for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E
final int bitVal = ((int) (bitValues >>> i)) & 1;
PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset));
PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal);
}
}
srcBitsAddr += WORD_BYTES;
}
https://goo.gl/EgLy9r

Node 1
Mux’d
Practice: Parallel Columnar Shuffle
• Partition data based on a hashed key
• Avoid excessive batch buffering cost
• Steps
1. Consolidate node-local streams
• Allow reduction in buffering memory in large
clusters (k*n instead of n*n)
2. Hash the key(s) to determine bucket offset
• Generate bucket vector
3. Pre-allocate output buffers at target output
size
• Sized depending on narrow/wide batches
4. Do columnar copies per vector
• Written in C-like low overhead pattern with
no abstraction
Node 2
Thread 1 Thread 2
generate bucket vector
Do bucket-
level copies
Gathering
Write
Thread 1 Thread 2

Example Copier Code
• Two byte offset
addresses (sv2)
• Tight loop focused on
• Far more efficient than
runtime-generated row-
wise code
– Also has faster startup
time
public void copy(long offsetAddr, int count) {
final List<ArrowBuf> sourceBuffers = source.getFieldBuffers();
targetAlt.allocateNew(count);
final List<ArrowBuf> targetBuffers = target.getFieldBuffers();
final long max = offsetAddr + count * STEP_SIZE;
final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress();
for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){
PlatformDependent.putLong(dstAddr,
PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE));
}
}
https://goo.gl/fZEsfy

Unnesting List Vectors
• Common Pattern: List of objects that want to be
unrolled to separate records.
• Arrow’s representation allows a direct unroll (no
inner data copies required)
• Since leaf vectors can be larger (up to 2B), may
need to split apart inner vectors
– Make use of SplitAndTransfer necessary
– SplitAndTransfer as cheap as possible
• Noop for fixed data
• Offset rewrite for variable width vectors, noop for variable
data
• Bit rewrite & shifting for Validity vectors
List Vector
OffsetVector
Struct Vector
Inner Vectors

What’s Coming
• Arrow RPC/REST
– Generic way to retrieve data in Arrow format
– Generic way to serve data in Arrow format
– Simplify integrations across the ecosystem
• Arrow Routines
– GPU and LLVM

Get Involved
• Join the community
– dev@arrow.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ, @intjesus

Apache Arrow: In Theory, In Practice

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Apache Arrow: In Theory, In Practice (20)

Recently uploaded (20)

Apache Arrow: In Theory, In Practice