SlideShare a Scribd company logo
Apache Arrow: Leveling Up the Analytics Stack
Leveling up the Analytics Stack
Wes McKinney
@wesmckinn
Apache
ibis
The Need for Speed
21 Years Ago...
Apache Arrow: Leveling Up the Analytics Stack
A semi-revisionist
history of Big Data
Cloudera’s Jeff Hammerbacher
“Data first, ask questions later”
Enable everyone to
“party on the data”
(paraphrasing!)
Decoupling Storage from Processing
Simplified MapReduce Arch
Storage
Step 1 Step 2 Step 3 ...
“Scalability! But at what COST?”
McSherry, Isard, Murray 2015
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
Configuration that
Outperforms a
Single
Thread
Brute Force Scalability
Storage Feasibility
Computational Feasibility
Resource Utilization
Interactivity
Up the hierarchy of needs...
Rise of “End-to-end” Execution Engines
Storage
Step 1 Step 2 Step 3 ...
INPUT OUTPUT
End-to-end engines: drawbacks
• Example: SQL on Hadoop systems, Apache Spark,
others
• Serve some use cases well, others less well
• Fall short in ML/AI domain
The Interoperability Conundrum
Engine A Engine B
Data Handoff
Some Hardware Trends
• Manycore processor architectures
• Much faster disk
• Much faster networking
• Beyond CPUs
“Why Modern CPUs Are Starving
and What Can Be Done about It”
Francesc Alted, IEEE 2010
Recognizing Serialization as
an Enemy
Serialization
Translation of data into a form that can
be stored or transmitted, and
reconstructed later
How to Eliminate Serialization
“Serialized” and In-Memory Format
must be the same (or nearly so)
A Collective Realization in 2015
Many open source developers had noted the
absence of an in-memory standard for
structured data analytics
● Language-agnostic in-memory format for
analytical query processing on modern
hardware
● Low-overhead data sharing and transport
● A cross-language development platform to
build Arrow-powered applications
Mission
Why Column-oriented?
• Reduce unnecessary IO
• Increase memory throughput
• Better parallelism
• Leverage SIMD instructions
Apache Arrow “meta” goals
• Forge collaborations between database
systems and data science / ML / AI
communities
• Eliminate barriers to code sharing between
application ecosystems and programming
languages
Community over Code
• ASF open governance model
• ~400 unique contributors
• 49 committers, 28 PMC members
• 11 programming languages
represented
Arrow Development in Practice
• “Core” format and protocol implementations
• “Batteries-included” standard libraries
• Common build / test / package infrastructure and
compatibility testing
Language Relationships
C++
Java
Go
Rust
C#
JavaScript
C Ruby
Python
R
MATLAB
Some Arrow Success Stories
Apache
• gRPC-based framework for custom data
services
• High-speed network dataset transfer
• Now available for C++, Java, Python
Arrow Flight: Fast Data Services
Development Partners
Flight key ideas
• Zero-serialization
• Bidirectional streaming transfers
• Parallel transfers + horizontal scalability
designed into the protocol
• Reap benefits of Google’s work on gRPC
Flight use cases
• Replacing slow database protocols like
JDBC / ODBC
• General network data movement
• Retrofit legacy systems with fast Arrow IO
Notable Arrow subcomponents
Rust DataFusion
Arrow-native
Rust query
engine
Gandiva
LLVM analytical
expression
compiler
Plasma
Shared memory
object store
Funding Arrow Development
• Apache projects are technically communities of
volunteers
• Much development contributed by direct users of Arrow
• Ursa Labs: not-for-profit group I founded in 2018 with
initial support of RStudio and Two Sigma
Ursa Labs Sponsors
Thank you!
https://arrow.apache.org
https://ursalabs.org

More Related Content

PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Ursa Labs and Apache Arrow in 2019
PPTX
Future of pandas
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Ursa Labs and Apache Arrow in 2019
Future of pandas
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow at DataEngConf Barcelona 2018
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

What's hot (20)

PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Presto: Distributed sql query engine
PDF
New Directions for Apache Arrow
PDF
Presto
PPTX
Presto: SQL-on-anything
PDF
Presto @ Facebook: Past, Present and Future
PDF
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
PDF
Presto at Hadoop Summit 2016
PPTX
Membase Meetup 2010
PPTX
Securing Data in Hadoop at Uber
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Operationalizing Big Data Pipelines At Scale
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PDF
Boston Hadoop Meetup: Presto for the Enterprise
PDF
Open Source DataViz with Apache Superset
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PDF
Presto @ Uber Hadoop summit2017
ODP
Presto
Apache Arrow -- Cross-language development platform for in-memory data
Presto: Distributed sql query engine
New Directions for Apache Arrow
Presto
Presto: SQL-on-anything
Presto @ Facebook: Past, Present and Future
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
Presto at Hadoop Summit 2016
Membase Meetup 2010
Securing Data in Hadoop at Uber
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Operationalizing Big Data Pipelines At Scale
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Boston Hadoop Meetup: Presto for the Enterprise
Open Source DataViz with Apache Superset
Large Scale Graph Analytics with JanusGraph
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Presto @ Uber Hadoop summit2017
Presto
Ad

Similar to Apache Arrow: Leveling Up the Analytics Stack (20)

PDF
Datacenter Computing with Apache Mesos - BigData DC
PPTX
Building FoundationDB
PDF
Architecture Patterns - Open Discussion
PDF
Introduction to Apache Mesos and DC/OS
PDF
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
PDF
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
PDF
Apache Spark Presentation good for big data
PPTX
After the LAMP, it's time to get MEAN
PDF
Hpc lunch and learn
PPTX
Microsoft Openness Mongo DB
PPTX
Above the cloud joarder kamal
PDF
Michael stack -the state of apache h base
PDF
Nisha talagala keynote_inflow_2016
PPTX
Introducing MemSQL 4
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
PPTX
Membase Meetup - Silicon Valley
PPTX
Cross-platform interaction
Datacenter Computing with Apache Mesos - BigData DC
Building FoundationDB
Architecture Patterns - Open Discussion
Introduction to Apache Mesos and DC/OS
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Apache Spark Presentation good for big data
After the LAMP, it's time to get MEAN
Hpc lunch and learn
Microsoft Openness Mongo DB
Above the cloud joarder kamal
Michael stack -the state of apache h base
Nisha talagala keynote_inflow_2016
Introducing MemSQL 4
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Apache Spark and the Emerging Technology Landscape for Big Data
Databricks Meetup @ Los Angeles Apache Spark User Group
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Membase Meetup - Silicon Valley
Cross-platform interaction
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Python Data Wrangling: Preparing for the Future
PDF
PyCon APAC 2016 Keynote
PDF
Apache Arrow and Python: The latest
PDF
High Performance Python on Apache Spark
PDF
Python Data Ecosystem: Thoughts on Building for the Future
PDF
Improving data interoperability in Python and R
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PDF
Enabling Python to be a Better Big Data Citizen
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: High Performance Columnar Data Framework
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Cross-language Development Platform for In-memory Data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science
Improving Python and Spark (PySpark) Performance and Interoperability
Python Data Wrangling: Preparing for the Future
PyCon APAC 2016 Keynote
Apache Arrow and Python: The latest
High Performance Python on Apache Spark
Python Data Ecosystem: Thoughts on Building for the Future
Improving data interoperability in Python and R
Next-generation Python Big Data Tools, powered by Apache Arrow
Apache Arrow (Strata-Hadoop World San Jose 2016)
Enabling Python to be a Better Big Data Citizen

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...

Apache Arrow: Leveling Up the Analytics Stack