SlideShare a Scribd company logo
A Rusty Introduction to Apache
Arrow and how it Applies to a
Time Series Database
December 9, 2020
Andrew Lamb
InfluxData
IOx Team at InfluxData
Query Optimizer / Architect @ Vertica
(Columnar Database),
Chief Architect @ DataRobot (Machine
Learning Platform )
Chief Architect @ Nutonian (Machine
Learning Apps
XLST JIT Compiler Team at DataPower
Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤
● Opinions and Perspectives of Databases
● Background on Arrow
● Arrow Examples, in Rust
Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
...
… and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
with Rust and Arrow
Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc)
○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent”
○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, etc etc
● Analytic (OLAP, “Big Data”, etc)
○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated
○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
⇒ Rest of the talk focused on Analytic Databases
So, you want to build a new database… ?
Databases need many features just to look like a database:
● Get Data In and Out
● Store Data and Catalog / Metadata
● Query Store: + Query Language
● Connect: Client API
…
Before you can invest in what makes your database special
Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a
Database”
🤔
“Ok now
this is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
Arrow Project Goals
“Build a better open source
foundation for data science”
🤔 How is this related to databases?
https://arrow.apache.org/
Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format (persistence) => Parquet
Columnar memory representation => Arrow Arrays
Operations (e.g. add, multiply) => Compute Kernels
Network transfer => Arrow Flight IPC
_ => ... to be continued ...
}
InfluxDB line protocol
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
Line Protocol Tutorial (link)
Measurements
Tags Fields
Timestamp
IOx Data Model
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
Table: weather
Code Examples
Thesis: “When writing an analytic database, you will end up implementing the
Arrow feature set”
(Ecosystem integration is another major benefit of Arrow, subject of a future talk)
+
* Take performance comparisons with a large grain of salt
Compare Plain Rust and Rust using the Arrow library
Motivating Example
“Find the rows that are not in `us-west`”
Create the Array
let string_vec: Vec<String> =
(0..NUM_TAGS)
.map(|i| {
match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
}.into()
})
.collect();
let mut builder =
StringBuilder::new(NUM_TAGS);
(0..NUM_TAGS).enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append_value(location)
.unwrap()
});
let array = builder.finish();
> created array with 10000000 elements
~600ms
> created array with 10000000 elements
~400ms
+
Memory Footprint
let size =
size_of::<Vec<String>>() +
string_vec
.iter()
.fold(0, |sz, s| {
sz + size_of::<String>() + s.len()
});
println!("total size: {} bytes", size);
println!("total size: {} bytes",
array.get_array_memory_size());
> total size: 320000023 bytes
~320 MB *
> total size: 149206128 bytes
~150 MB
+
Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| s != "us-west")
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
let not_west_bitset =
neq_utf8_scalar(
&array,
"us-west"
).unwrap();
let num_not_west = not_west_bitset
.iter()
.filter(|v| matches!(v, Some(true)))
.count();
> Found 6666667 not in west
~50ms
> Found 6666667 not in west
~120ms
+
Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> = ...;
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| {
s.as_ref()
.map(|s| s != "us-west")
.unwrap_or(false)
})
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
+
Same as previous
> Found 6666667 not in west
~50ms
Materialize rows for future processing
let not_west: Vec<String> = not_west_bitset
.iter()
.enumerate()
.filter_map(|(i, &v)| {
if v {
Some(string_vec[i].clone())
} else {
None
}
})
.collect();
let not_west = filter(
&array,
&not_west_bitset
).unwrap();
> Made array of 6666667 Strings not in west
~450 ms
> Made array of 6666667 Strings not in west
~50 ms
+
More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int8Builder::new();
let mut builder =
StringDictionaryBuilder::new(vb,kb);
(0..NUM_TAGS)
.enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append(location).unwrap();
});
let array = builder.finish();
> total size: 10000688 bytes
10MB
250 ms
+
dictionary
"us-east"
"us-midwest"
"us-west"
Location
0
1
2
0
1
2
0
1
2
[0]
[1]
[2]
[u8]
SIMD Anyone?
let output = gt(
&left,
&right
).unwrap();
+
10
20
17
5
23
5
9
12
4
5
76
2
3
5
2
33
2
1
6
7
8
2
7
2
5
6
7
8
left right output
1
0
1
1
1
0
1
1
0
1
1
0
0
0
>
>
>
>
SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
feature = "simd"))]
fn simd_compare_op<T, F>(left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray>
where
T: ArrowNumericType,
F: Fn(T::Simd, T::Simd) -> T::SimdMask,
{
// use / error checking elided
let null_bit_buffer = combine_option_bitmap(
left.data_ref(), right.data_ref(), len
)?;
let lanes = T::lanes();
let mut result = MutableBuffer::new(
left.len() * mem::size_of::<bool>()
);
let rem = len % lanes;
for i in (0..len - rem).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = op(simd_left, simd_right);
T::bitmask(&simd_result, |b| {
result.write(b).unwrap();
});
}
Source: arrow/src/compute/kernels/comparison.rs
if rem > 0 {
let simd_left = T::load(left.value_slice(len - rem, lanes));
let simd_right = T::load(right.value_slice(len - rem, lanes));
let simd_result = op(simd_left, simd_right);
let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize;
T::bitmask(&simd_result, |b| {
result.write(&b[0..rem_buffer_size]).unwrap();
});
}
let data = ArrayData::new(
DataType::Boolean,
left.len(),
None,
null_bit_buffer,
0,
vec![result.freeze()],
vec![],
);
Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data)))
}
Other things needed in a database
Vec<Option<String>> to support nulls
Handle other data types with same code
Vectorized implementations of filter, aggregate, etc
Persist it to storage
Send data over the network
Ecosystem compatibility
...
Rust / Arrow Community: Good and Getting better
Major Roadmap Items (see also Apache Arrow (Rust) 2.0.0)
1. Support Stable Rust
2. Improved DictionaryArray support and performance
3. Improved compute kernel performance
4. SQL: Joins
5. Parallel CPU-bound operations; Additional platform support (e.g. ARMv8)
InfluxData specifically is investing in:
1. Flight IPC
2. Improved Dictionary and Date/Time support
3. Data Fusion (some other tech talk)
Thank You
Find us online
Github: https://github.com/influxdata/influxdb_iox
Slack: https://influxdata.com/slack
It is early days; there are many cool things left to implement
And we are hiring (Senior IOx Engineer Job Posting)

More Related Content

PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Catalogs - Turning a Set of Parquet Files into a Data Set
PPTX
Introduction to Storm
PDF
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
Efficient Data Storage for Analytics with Apache Parquet 2.0
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Large Scale Graph Analytics with JanusGraph
Catalogs - Turning a Set of Parquet Files into a Data Set
Introduction to Storm
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

What's hot (20)

PDF
Understanding InfluxDB’s New Storage Engine
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
The Apache Spark File Format Ecosystem
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Linux tuning to improve PostgreSQL performance
PDF
Solving Enterprise Data Challenges with Apache Arrow
PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
ksqlDB로 시작하는 스트림 프로세싱
PPTX
Apache Spark Architecture
PPT
Parquet overview
PDF
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PPTX
Apache Arrow: In Theory, In Practice
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Apache airflow
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
Introduction to Redis
Understanding InfluxDB’s New Storage Engine
The columnar roadmap: Apache Parquet and Apache Arrow
The Apache Spark File Format Ecosystem
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
ClickHouse Deep Dive, by Aleksei Milovidov
Linux tuning to improve PostgreSQL performance
Solving Enterprise Data Challenges with Apache Arrow
Real-time Analytics with Trino and Apache Pinot
ksqlDB로 시작하는 스트림 프로세싱
Apache Spark Architecture
Parquet overview
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Apache Arrow: In Theory, In Practice
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache airflow
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Introduction to Redis
Ad

Similar to A Rusty introduction to Apache Arrow and how it applies to a time series database (20)

PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
PPTX
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PPTX
Memory Interoperability in Analytics and Machine Learning
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PPTX
Rust & Apache Arrow @ RMS
PDF
New Directions for Apache Arrow
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PPTX
Cloud Programming Models: eScience, Big Data, etc.
PDF
Ursa Labs and Apache Arrow in 2019
KEY
Anchor Modeling GSE11 Presentation
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
PDF
What Does Big Data Mean and Who Will Win
PDF
Rust is for "Big Data"
PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Memory Interoperability in Analytics and Machine Learning
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow
ACM TechTalks : Apache Arrow and the Future of Data Frames
Rust & Apache Arrow @ RMS
New Directions for Apache Arrow
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Cloud Programming Models: eScience, Big Data, etc.
Ursa Labs and Apache Arrow in 2019
Anchor Modeling GSE11 Presentation
Machine Learning with ML.NET and Azure - Andy Cross
What Does Big Data Mean and Who Will Win
Rust is for "Big Data"
An Introduction to Apache Arrow for Python Programmers.pptx
Ad

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Monthly Chronicles - July 2025
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

A Rusty introduction to Apache Arrow and how it applies to a time series database

  • 1. A Rusty Introduction to Apache Arrow and how it Applies to a Time Series Database December 9, 2020 Andrew Lamb InfluxData
  • 2. IOx Team at InfluxData Query Optimizer / Architect @ Vertica (Columnar Database), Chief Architect @ DataRobot (Machine Learning Platform ) Chief Architect @ Nutonian (Machine Learning Apps XLST JIT Compiler Team at DataPower
  • 3. Goals + Outline Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤ ● Opinions and Perspectives of Databases ● Background on Arrow ● Arrow Examples, in Rust
  • 4. Databases -- Trend Towards Specialization Relational Key-Value Timeseries Graph Array / Scientific Document Stream Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1 Data Model Deployment Embedded / Edge Cloud Single-Node Hybrid Ecosystem Hadoop Java Json / Javascript AWS GCP Azure Apple Cloud Use Case Transactions Analytics Streaming ...
  • 5. … and our new database is … 🎉 InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow
  • 6. Analytic Systems (vs Transactional) ● Transactional (OLTP, Key-value stores, etc) ○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent” ○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, etc etc ● Analytic (OLAP, “Big Data”, etc) ○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated ○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc ⇒ Rest of the talk focused on Analytic Databases
  • 7. So, you want to build a new database… ? Databases need many features just to look like a database: ● Get Data In and Out ● Store Data and Catalog / Metadata ● Query Store: + Query Language ● Connect: Client API … Before you can invest in what makes your database special
  • 8. Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery
  • 9. Arrow Project Goals “Build a better open source foundation for data science” 🤔 How is this related to databases? https://arrow.apache.org/
  • 10. Arrow == toolkit for a modern analytic databases match tool_needed { File Format (persistence) => Parquet Columnar memory representation => Arrow Arrays Operations (e.g. add, multiply) => Compute Kernels Network transfer => Arrow Flight IPC _ => ... to be continued ... }
  • 11. InfluxDB line protocol weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 Line Protocol Tutorial (link) Measurements Tags Fields Timestamp
  • 12. IOx Data Model weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 location "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" temperature 82 82 70 83 87 72 84 90 71 humidity 67 65 54 69 78 56 67 82 57 timestamp 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z Table: weather
  • 13. Code Examples Thesis: “When writing an analytic database, you will end up implementing the Arrow feature set” (Ecosystem integration is another major benefit of Arrow, subject of a future talk) + * Take performance comparisons with a large grain of salt Compare Plain Rust and Rust using the Arrow library
  • 14. Motivating Example “Find the rows that are not in `us-west`”
  • 15. Create the Array let string_vec: Vec<String> = (0..NUM_TAGS) .map(|i| { match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }.into() }) .collect(); let mut builder = StringBuilder::new(NUM_TAGS); (0..NUM_TAGS).enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append_value(location) .unwrap() }); let array = builder.finish(); > created array with 10000000 elements ~600ms > created array with 10000000 elements ~400ms +
  • 16. Memory Footprint let size = size_of::<Vec<String>>() + string_vec .iter() .fold(0, |sz, s| { sz + size_of::<String>() + s.len() }); println!("total size: {} bytes", size); println!("total size: {} bytes", array.get_array_memory_size()); > total size: 320000023 bytes ~320 MB * > total size: 149206128 bytes ~150 MB +
  • 17. Find Rows != “us-west” let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| s != "us-west") .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); let not_west_bitset = neq_utf8_scalar( &array, "us-west" ).unwrap(); let num_not_west = not_west_bitset .iter() .filter(|v| matches!(v, Some(true))) .count(); > Found 6666667 not in west ~50ms > Found 6666667 not in west ~120ms +
  • 18. Find Rows != “us-west” (with null handling) let string_vec: Vec<Option<String>> = ...; let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| { s.as_ref() .map(|s| s != "us-west") .unwrap_or(false) }) .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); + Same as previous > Found 6666667 not in west ~50ms
  • 19. Materialize rows for future processing let not_west: Vec<String> = not_west_bitset .iter() .enumerate() .filter_map(|(i, &v)| { if v { Some(string_vec[i].clone()) } else { None } }) .collect(); let not_west = filter( &array, &not_west_bitset ).unwrap(); > Made array of 6666667 Strings not in west ~450 ms > Made array of 6666667 Strings not in west ~50 ms +
  • 20. More efficient encoding (dictionary) let vb = StringBuilder::new(); let kb = Int8Builder::new(); let mut builder = StringDictionaryBuilder::new(vb,kb); (0..NUM_TAGS) .enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append(location).unwrap(); }); let array = builder.finish(); > total size: 10000688 bytes 10MB 250 ms + dictionary "us-east" "us-midwest" "us-west" Location 0 1 2 0 1 2 0 1 2 [0] [1] [2] [u8]
  • 21. SIMD Anyone? let output = gt( &left, &right ).unwrap(); + 10 20 17 5 23 5 9 12 4 5 76 2 3 5 2 33 2 1 6 7 8 2 7 2 5 6 7 8 left right output 1 0 1 1 1 0 1 1 0 1 1 0 0 0 > > > >
  • 22. SIMD Implementation #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), feature = "simd"))] fn simd_compare_op<T, F>(left: &PrimitiveArray<T>, right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray> where T: ArrowNumericType, F: Fn(T::Simd, T::Simd) -> T::SimdMask, { // use / error checking elided let null_bit_buffer = combine_option_bitmap( left.data_ref(), right.data_ref(), len )?; let lanes = T::lanes(); let mut result = MutableBuffer::new( left.len() * mem::size_of::<bool>() ); let rem = len % lanes; for i in (0..len - rem).step_by(lanes) { let simd_left = T::load(left.value_slice(i, lanes)); let simd_right = T::load(right.value_slice(i, lanes)); let simd_result = op(simd_left, simd_right); T::bitmask(&simd_result, |b| { result.write(b).unwrap(); }); } Source: arrow/src/compute/kernels/comparison.rs if rem > 0 { let simd_left = T::load(left.value_slice(len - rem, lanes)); let simd_right = T::load(right.value_slice(len - rem, lanes)); let simd_result = op(simd_left, simd_right); let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize; T::bitmask(&simd_result, |b| { result.write(&b[0..rem_buffer_size]).unwrap(); }); } let data = ArrayData::new( DataType::Boolean, left.len(), None, null_bit_buffer, 0, vec![result.freeze()], vec![], ); Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data))) }
  • 23. Other things needed in a database Vec<Option<String>> to support nulls Handle other data types with same code Vectorized implementations of filter, aggregate, etc Persist it to storage Send data over the network Ecosystem compatibility ...
  • 24. Rust / Arrow Community: Good and Getting better Major Roadmap Items (see also Apache Arrow (Rust) 2.0.0) 1. Support Stable Rust 2. Improved DictionaryArray support and performance 3. Improved compute kernel performance 4. SQL: Joins 5. Parallel CPU-bound operations; Additional platform support (e.g. ARMv8) InfluxData specifically is investing in: 1. Flight IPC 2. Improved Dictionary and Date/Time support 3. Data Fusion (some other tech talk)
  • 25. Thank You Find us online Github: https://github.com/influxdata/influxdb_iox Slack: https://influxdata.com/slack It is early days; there are many cool things left to implement And we are hiring (Senior IOx Engineer Job Posting)