Data Science Without Borders (JupyterCon 2017)

Wes McKinney @wesmckinn
DATA SCIENCE WITHOUT
BORDERS
WES MCKINNEY @WESMCKINN
JupyterCon | August 2017

I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes only and should not be used for any other purpose
(including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes
only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any
offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of
Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at
any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright
or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma,
nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn 3

THINKING ON THE LAST 10 YEARS
4
2007 2017

A shared front-end
for data science

THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …

THE AI ARMS RACE

CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9

T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10

WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11

WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
pandas NumPy
pandas
NumPy
pandas
scikit-learn
12

RENOVATING PANDAS

T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14

PROGRAMMING LANGUAGES
AS USER INTERFACES
15

GRAPHIC: Iceberg under sea (only top
part visible to naked eye)

T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DIFFERENT
IMPLEMENTATION
17

T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18

T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…

T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21

T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22

T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” in Larger Systems
23

APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24

T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache affinity
• ASF Governance: Open + Transparent
Community Project
25

FEATHER: MINIMALIST ARROW ON DISK

Some Arrow OSS Users
Feather Format
Ray Project
27

Data Science Without Borders (JupyterCon 2017)

Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org

Data Science Without Borders (JupyterCon 2017)

More Related Content

Similar to Data Science Without Borders (JupyterCon 2017) (20)

More from Wes McKinney (20)

Recently uploaded (20)

Data Science Without Borders (JupyterCon 2017)